[HN Gopher] Leveraging Zig's Allocators
       ___________________________________________________________________
        
       Leveraging Zig's Allocators
        
       Author : PaulHoule
       Score  : 159 points
       Date   : 2024-06-14 17:34 UTC (1 days ago)
        
 (HTM) web link (www.openmymind.net)
 (TXT) w3m dump (www.openmymind.net)
        
       | lionkor wrote:
       | I love the way Zig does allocators, when you compare it to Rust
       | where allocation failures just panic (rolls eyes)
        
         | bombela wrote:
         | It's getting there eventually! https://doc.rust-
         | lang.org/std/boxed/struct.Box.html#method.t...
        
         | hypeatei wrote:
         | I agree, the lack of control is frustrating but on the
         | contrary: how much software is actually going to do anything
         | useful if allocation is failing? Designing your std library
         | around the common case then gathering input on what memory
         | fallible APIs should look like is smarter IMO.
        
           | hansvm wrote:
           | Most problems have speed/memory/disk tradeoffs available.
           | Simple coding strategies include "if RAM then do the fast
           | thing, else do the slightly slower thing", "if RAM then
           | allocate that way, else use mmap", "if RAM then continue,
           | else notify operator without throwing away all their work",
           | ....
           | 
           | Rust was still probably right to not expose that at first
           | since memory is supposed be fairly transparent, but Zig
           | forces the user to care about memory, and given that
           | constraint it's nearly free to also inform them of problems.
           | The stdlib is already designed (like in Rust) around
           | allocations succeeding, since those errors are just passed to
           | the caller, but Zig can immediately start collecting data
           | about how people use those capabilities. At a language level,
           | including visibility into allocation failures was IMO a good
           | idea.
        
         | kibwen wrote:
         | The Rust standard library aborts on allocation failure using
         | the basic APIs, but Rust itself doesn't allocate. If someone
         | wanted to write a Zig-style library in Rust, it would work just
         | fine.
        
       | skybrian wrote:
       | Is there a reason why someone wouldn't use retain_with_limit or
       | is doing without it just an exercise?
        
         | latch wrote:
         | The inspiration for the post came from my httpz library. The
         | fallback using a FixedBufferAllocator + ArenaAllocator is used.
         | The fixed buffer is a thread local. But the arena allocators
         | belong to connections, of which there could be thousands.
         | 
         | You might have 1 fixed buffer, for N (500+) ArenaAllocators
         | (but only being used per one at a time). This allows you to
         | allocate a relatively large fixed buffer since you have
         | relatively few threads.
         | 
         | If you just used retain_with_limit, then you'd either have to
         | have a much smaller retained size, or you'd need a lot more
         | memory.
         | 
         | https://github.com/karlseguin/http.zig/blob/c8b04e3fef5abf32...
        
       | mikemitchelldev wrote:
       | Off topic but I wish Go had chosen to use fn rather than func
        
         | ziggy_star wrote:
         | Oh hey, are you that Mitch?
         | 
         | I literally just signed up to ask if anybody can recommend any
         | good Zig codebases to read other than Tigerbeatle. How's your
         | terminal going?
         | 
         | Edit: The rest of the posted site seems like a treasure trove
         | not just this one article. Was wondering how to get into Zig
         | and here we are. Such kismet.
         | 
         | Almost missed it so heads up for others.
        
           | mikemitchelldev wrote:
           | No, sorry not me. Though I have signed up for an invite for
           | that terminal by Mitchell Hashimoto (I think his name is).
        
           | slimsag wrote:
           | We're working on a game engine in Zig[0]
           | 
           | If you're looking for interesting Zig codebases to read, you
           | might be interested in our low-level audio input/output
           | library[1] or our module system[2] codebase - the latter
           | includes an entity component system and uses Zig's comptime
           | to a great degree to enable some interesting flexibility
           | (dependency injection, global view of the world, etc.) while
           | maintaining a great amount of type safety in an otherwise
           | dynamic system.
           | 
           | [0] https://machengine.org/
           | 
           | [1] https://github.com/hexops/mach/tree/main/src/sysaudio
           | 
           | [2] https://github.com/hexops/mach/tree/main/src/module
        
             | ziggy_star wrote:
             | Thanks!
        
         | samatman wrote:
         | I happen to agree. When I see 'fn' I "hear" function, when I
         | see 'func' I hear "funk, but misspelled".
         | 
         | Also, with four space indentation (or for Go, a four space
         | tabset), 'func' aligns right where the code begins, pushing the
         | function name off one space to the right. For 'fn' the function
         | name starts one space before the code, I find this more
         | aesthetic. Then again, the standard tabset is eight spaces, so
         | this matters less in Go.
         | 
         | It would be pretty silly to pick a language on the basis of
         | that kind of superficial window dressing, of course. But I know
         | which one I prefer.
        
         | whobre wrote:
         | I'll never understand why people care about such things.
        
           | tapirl wrote:
           | Maybe someone think "func" listens like another word.
        
           | akira2501 wrote:
           | You don't understand personal preferences? Or you don't
           | understand the desire to share them with your peers? Or you
           | can't understand why people don't just bully themselves into
           | silence for the benefit of others?
        
         | scns wrote:
         | I like the way Kotlin did it: fun. Gives a nice alignement of
         | four with the space. And functional programming can be fun.
        
       | gizmo wrote:
       | I'm not 100% sure how Zig allocators work but it looks like the
       | arena memory is getting re-used without zeroing the memory? With
       | slight memory corruption freed memory from a previous request can
       | end up leaking. That's not great.
       | 
       | Even if you don't have process isolation between workers (which
       | is generally what you want) then you can still put memory arenas
       | far apart in virtual memory, make use of inaccessible guard
       | pages, and take other precautions to prevent catastrophic memory
       | corruption.
        
         | eknkc wrote:
         | I guess you could place a zeroing allocator wrapper in between
         | the arena and it's underlying allocator. That would write zero
         | to anything that's getting freed. Arena deinit will free
         | anything allocated from the underlying allocator so upon
         | completion of each request, used memory would be zeroed before
         | returned back to the main allocator.
         | 
         | And that handler signature would still be the same. Which is
         | the he whole point of this article so, yay.
        
         | nurpax wrote:
         | The same can happen with C malloc/free too.
        
         | KerrAvon wrote:
         | That's not specific to Zig -- local heap allocators generally
         | don't zero deallocated memory -- that's a significant,
         | unnecessary performance hit.
         | 
         | If you need data to be isolated when memory is corrupt, you
         | need it to be isolated always.
        
           | 10000truths wrote:
           | memset is _the_ golden example of an easily pipelined,
           | parallelized, predictable CPU operation - any semi-modern CPU
           | couldn 't ask for easier work to do. Zeroing 8 KB of memory
           | is _very_ cheap.
           | 
           | If we use a modern Xeon chip as an example, an AVX2 store has
           | a throughput of 2 instructions / cycle. Doing that 256 times
           | for 8 KB totals 128 cycles, plus a few extra cycles to
           | account for the latency of issuing the first instruction and
           | the last store to the L1 cache. With a 2 GHz clock frequency,
           | it still takes less than 70 nanoseconds. For comparison, an
           | integer divide has a worst-case latency of 90ish cycles, or
           | 45ish nanoseconds.
        
             | celrod wrote:
             | This memory is now the least recently used in the L1 cache,
             | despite being freed by the allocator, meaning it probably
             | isn't being used again.
             | 
             | If it was freed after already being removed from the L1
             | cache, then you also need to evict other L1 cache contents
             | and wait for it to be read into L1 so you can write to it.
             | 
             | 128 cycles is a generous estimate, and ignores the costs to
             | the rest of the program.
        
               | 10000truths wrote:
               | The worker is already reading/writing to the buffer
               | memory to service each incoming HTTP request, whether the
               | memory is zeroed or not. The side effects on the CPU
               | cache are insubstantial.
        
               | astrange wrote:
               | You can use non-temporal writes to avoid this, and some
               | CPUs have an instruction that zeroes a cache line. It's
               | not expensive to do this.
        
             | alexchamberlain wrote:
             | This might be a stupid question, but why isn't zeroing 8KB
             | of memory a single instruction? It must be so common as to
             | be worthy that all the layers of memory (and indirection)
             | to understand that.
        
               | saagarjha wrote:
               | Zeroing something that large is not typical. That said,
               | some architectures have optimized zeroing instructions,
               | such as dc zva on ARM.
        
               | astrange wrote:
               | If the memory is above the size of a page, you can tell
               | the VM to drop the page and give you a new zero filled
               | one instead.
        
               | josephg wrote:
               | For 8kb? Syscalling in to the kernel, updating the
               | processes's memory map and then later faulting is
               | probably slower by an order of magnitude or more compared
               | to just setting those bytes to zero.
               | 
               | Memcpy, bzero and friends are insanely fast. Practically
               | free when those bytes are in the cpu's cache already.
        
               | astrange wrote:
               | So don't syscall. Darwin has a system similar to io_uring
               | for this.
               | 
               | (But it also has a 16KB page size.)
        
             | toast0 wrote:
             | Zeroing memory is very cheap, but not zeroing it is even
             | cheaper.
             | 
             | Zeroing memory on deallocation can be important for
             | sensitive data. Otherwise, it makes more sense to zero on
             | allocation _if_ you know that it 's needed because the
             | allocated structure will be used without initilazation
             | _and_ the memory isn 't zero by guarantee (most OSes
             | guarantee newly allocated memory will be zero, and have a
             | process to zero pages in the background when possible)
        
               | 10000truths wrote:
               | Sure, but in most practical applications where an HTTP
               | server is involved, zeroing the request/response buffer
               | memory is very unlikely to ever be your bottleneck. Even
               | at 10K RPS per core, your per-request CPU time budget is
               | 100 microseconds. Zeroing memory will only account for a
               | fraction of a percentage of that.
               | 
               | If you're exposing an HTTP API to clients, it's likely
               | that any response's contents will contain sensitive
               | client-specific data. If memory corruption bugs are more
               | likely than bottlenecking on zeroing out your
               | request/response buffer, then zeroing the
               | request/response buffer is a good idea, until proven
               | otherwise by benchmarks or profiling.
        
             | secondcoming wrote:
             | compilers are probably going to remove that memset.
        
               | josephg wrote:
               | In C, you can use explicit_bzero to make sure the
               | instructions aren't removed by the optimiser:
               | 
               | https://man7.org/linux/man-pages/man3/bzero.3.html
        
         | keybored wrote:
         | Deinit in O(1) seems to be a big attraction of arenas.
        
           | foota wrote:
           | O(1) is nice, but I feel like avoiding walking a bunch of
           | data structures is maybe most important.
        
             | elvircrn wrote:
             | Any papers/blogs/SO answers covering this?
        
               | saagarjha wrote:
               | What are you looking for? Bump allocators are quite
               | simple, compared to typical allocators at least.
        
               | foota wrote:
               | I don't have anything for you, but if you have some
               | normally allocated hierarchal data structures in order to
               | free them you'll have to go through their members, chase
               | pointers, etc., to figure out the addresses to free, then
               | call free on them in sequence. That's all going to be a
               | lot more expensive than just memsetting a bunch of data
               | to zero, which you can do at whatever the speed of your
               | cores memory bandwidth is.
        
               | josephg wrote:
               | Yep. And you often don't even need to zero the data.
               | 
               | Generally, no paper or SO answer will tell you where
               | _your program_ spends its time. Learn to use profiling
               | tools, and experiment with stuff like this. Try out
               | arenas. Benchmark before and after and see what kind of
               | real performance difference it makes in your own program.
        
         | jedisct1 wrote:
         | Zig allocators can be composed, so adding zeroization would be
         | trivial.
        
         | samatman wrote:
         | I once spent an utterly baffling afternoon trying to figure out
         | why my benchmark for a reverse iteration across a rope data
         | structure in Julia was finishing way too fast. I was perf
         | tuning it, and while it would have been lovely if my
         | implementation was actually 50 times faster than reverse
         | iterating a native String type, I didn't buy it.
         | 
         | Finally figured it out: I flipped a sign in the reverse
         | iterator, so it was allocating a bunch of memory and
         | immediately hitting the margin of the Vector, and returning it
         | with most of the bytes undefined. Why didn't I catch it sooner?
         | Well, I kept running the benchmark, which allocated a reverse
         | buffer for the String version, which GC released, then I ran
         | the buggy code... and the GC picked up the recently freed
         | _correct data_ and handed it back to me! Oops.
         | 
         | Of course, if you want to avoid that risk in _Zig_ , you just
         | write a ZeroOnFreeAllocator, which zeros out your memory when
         | you free it. It's a drop in replacement for anything which
         | needs an allocator, job done.
        
         | tapirl wrote:
         | If needed, you should zero memory on allocation succeeds,
         | instead of zeroing it after it is freed.
        
           | alexchamberlain wrote:
           | Generally, you 0 on free in secure environments to avoid
           | leaking secrets from 1 section of knowledge to the next. ie a
           | request may contain a password, which the next request should
           | not have access to.
        
         | saagarjha wrote:
         | Guard pages are not enough to prevent memory corruption across
         | requests.
        
         | hansvm wrote:
         | In my Zig servers I'm using a similar arena-based (with
         | resetting) strategy. It's not as bad as you'd imagine:
         | 
         | The current alloc implementation memsets under the hood. There
         | are ongoing discussions about the right way to remove that
         | performance overhead, but safety comes first.
         | 
         | Any sane implementation has an arena per request and per
         | connection anyway, not shared between processes. You don't have
         | bonkers aliasing bugs because the OS would have panicked before
         | handing out that memory.
         | 
         | Zig has a lot of small features designed to make memory
         | corruption an unhappy code path. I've had one corruption bug
         | out of a lot of Zig code the last few years. It was from a
         | misunderstanding of async (a classic stack pointer leak
         | disguised by a confusion over async syntax). It's not an issue
         | since async is gone from the language, and that sort of thing
         | is normally turned into a compiler error anyway as soon as
         | somebody reports it.
        
       | eknkc wrote:
       | I think the last sample needs a `fba.reset()` call in between
       | requests.
       | 
       | BTW, I used zig a lot recently and the opaque allocator system is
       | great. You can create weird wrappers and stuff.
       | 
       | For example, the standard library json parser will parse json,
       | deserialize a type that you requested (say, a struct). But it
       | needs to allocate stuff. So it creates an arena for that specific
       | operation and returns a wrapper that has a `deinit` method.
       | Calling it deinits the arena so you essentially free everything
       | in your graph of structs, arrays etc. And since it receives an
       | upstream allocator for the arena, you could pass in any
       | allocator. A fixed stack allocator if you wish to use stack
       | space, another arena, maybe jemalloc wrapper. A test allocator
       | that checks for memory leaks.. Whatever.
        
         | latch wrote:
         | fixed, thanks.
        
         | hinkley wrote:
         | When I read json I always end up holding onto values from some
         | of the keys. Sometimes the keys too if the node is abstract
         | enough.
         | 
         | I assume the receiver then has to know it has to clone all of
         | those values, yes?
         | 
         | That seems a little tricky for general code and moreso for unit
         | tests.
        
           | anonymoushn wrote:
           | You can pass the json deserializer an allocator that is
           | appropriate for the lifetime of the object you want to get
           | out of it, so often no copying is required.
        
             | saagarjha wrote:
             | Right, but that means you lose the simplicity and
             | performance benefits of an arena allocator.
        
               | anonymoushn wrote:
               | I've mainly written unusual code that allocates a bunch
               | of FixedBufferAlocators up front and clears each of them
               | according to their own lifecycles. I agree that more
               | typical code would reach for a GPA or something here. If
               | you're using simdjzon, the tape and strings will be
               | allocated contiguously within a pair of buffers (and then
               | if you actually want to copy from the tape to your own
               | struct containing slices or pointers then you'll have to
               | decide where that goes), but the std json stuff will just
               | repeatedly call whatever allocator you give it.
        
           | forrestthewoods wrote:
           | Unit tests are trivial because you can _probably_ use a
           | single arena that is only reset once at the end of the test.
           | Unless the test is specifically to stress test memory in some
           | form.
           | 
           | > I assume the receiver then has to know it has to clone all
           | of those values, yes?
           | 
           | The receiver needs to understand the lifetime any which way.
           | If you parse a large JSON blob and wish to retain arbitrary
           | key/values you have to understand how long they're valid for.
           | 
           | If you're using a garbage collection language you can not
           | worry about it (you just have to worry about other things!).
           | You can think about it less if the key/values are ref-
           | counted. But for most C-like language implementations you
           | probably have to retain either the entire parsed structure or
           | clone the key/values you care about.
        
         | CyberDildonics wrote:
         | Why wouldn't this be better done with a class that takes care
         | of its memory when it goes out of scope?
        
           | anonymoushn wrote:
           | There are no automatically-invoked destructors in Zig.
        
             | saagarjha wrote:
             | Perhaps if this was added it would prove to be a better
             | solution in this case?
        
               | anonymoushn wrote:
               | It would prevent you from writing the bug at the top of
               | the thread.
               | 
               | I have stopped considering this sort of thing as a
               | potential addition to the language because the BDFL
               | doesn't like it. So realistically we must remember to
               | write reset, or defer deinit, etc. This sort of case
               | hurts a little, but people who are used to RAII will
               | experience more pain in cases where they want to return
               | the value or store it somewhere and some other code gains
               | responsibility for deinitializing it eventually.
        
           | tapirl wrote:
           | Need varies. Some memory needs to be still alive after the
           | parse process.
        
       | ctxcode wrote:
       | These kind of tactics work for simple examples. In real world
       | http servers you'll retain memory across requests (caches) and
       | you'll need a way to handle blocking io. That's why most commonly
       | we use GC'd/ownership languages for this + things like
       | goroutines/tokio/etc.. web devs dont want to deal with memory
       | themselfs.
        
         | sph wrote:
         | Off the top of my head, I was wondering... for software like
         | web services, isn't it easier and faster to use a bump
         | allocator per request, and release the whole block at the end
         | of it? Assuming the number of concurrent requests/memory usage
         | is known and you don't expect any massive spike.
         | 
         | I am working on an actor language kernel, and was thinking of
         | adopting the same strategy, i.e. using a very naive bump
         | allocator per actor, with the idea that many actors die pretty
         | quickly so you don't have to pay for the cost of GC most of the
         | time. You can run the GC after a certain threshold of memory
         | usage.
        
           | jerf wrote:
           | Have you looked at how Erlang does memory management within
           | its processes? You definitely can "get away" with a lot of
           | things when you have actors you can reasonably expect will be
           | small scale, if you are absolutely sure their data dies with
           | them.
        
           | ctxcode wrote:
           | If you never cache any data. Sure, u can use a bump
           | allocator. Otherwise it gets tricky. I havent worked with
           | actors really, but from the looks of it, it seems like they
           | would create alot of bottlenecks compared to coroutines. And
           | it would probably throw all your bump allocator performance
           | benefits out the window. As for the GC thing. You cant 'just'
           | call a GC.. Either you use a bump allocator or you use a GC.
           | Your GC cant steal objects from your bump allocator. It can
           | copy it... but then the reference changes and that's a big
           | problem.
        
             | anonymoushn wrote:
             | I think this comment assumes that you're using one
             | allocator, but it's probably normal in Zig to use one
             | allocator for your caches, and another allocator for your
             | per-request state, with one instance of the latter sort of
             | allocator for each execution context that handles requests
             | (probably coroutines). So you can just have both, and the
             | stuff that can go in the bump allocator does, and
             | concurrent requests don't step on each others toes.
        
           | hansvm wrote:
           | The problem _somebody_ between the hardware and your webapp
           | has to deal with is fragmentation, and it's especially
           | annoying with requests which don't evenly consume RAM. Your
           | OS can map pages around that problem, but it's cheaper to
           | have a contiguous right-sized allocation which you never re-
           | initialize.
           | 
           | Assuming the number of concurrent requests is known and they
           | have bounded memory usage (the latter is application-
           | dependant, the former can be emulated by 503-erroring excess
           | requests, or something trickier if clients handle that
           | poorly), yeah, just splay a bunch of bump allocators evenly
           | throughout RAM, and don't worry about the details. It's not
           | much faster though. The steady state for reset-arenas is that
           | they're all right-sized contiguous bump allocators. Using
           | that strategy, arenas are a negligible contribution to the
           | costs of a 200k QPS/core service.
        
         | latch wrote:
         | This example came from a real world http server. Admittedly,
         | Zig's "web dev" community is small, but we're trying :) I'm
         | sure a lot could be improved in httpz, but it's filling a gap.
        
         | samatman wrote:
         | It scales to complex examples as well. Retained memory would be
         | handled with its own allocator: for a large data structure like
         | an LRU cache, one would initialize it with a pointer to the
         | allocator, and use that internally to manage the memory.
         | 
         | Blocking (or rather, non-blocking, which is clearly what you
         | meant) IO is a different story. Zig had an async system, but it
         | had problems and got removed a couple point releases ago.
         | There's libxev[0] for evented programs, from Mitchell
         | Hashimoto. It's not mature yet but it offers a good solution to
         | single-threaded concurrency and non-blocking IO.
         | 
         | I don't think Zig is the best choice for multithreaded
         | programs, however, unless they're carefully engineered to share
         | little to no memory (using message passing, for instance).
         | You'd have to take care of locking and atomic ops manually, and
         | unlike memory bugs, Zig doesn't have a lot of built-in support
         | for catching problems with that.
         | 
         | A language with manual memory allocation isn't going to be the
         | language of choice for writing web servers, for pretty obvious
         | reasons. But for an application like squeezing the best
         | performance out of a resource-constrained environment, the
         | tradeoffs start to make sense.
         | 
         | [0]: https://github.com/mitchellh/libxev
        
         | pjmlp wrote:
         | Yes, we were creating Apache and IIS plugins 25 years ago for
         | PHP, Perl, Tcl, Coldfusion, ASP (scripting COM in VB), or C++
         | frameworks like ATLServer.
         | 
         | Very few were dealing with raw memory management in C, without
         | anything else.
         | 
         | And all of this evolved into the Web application servers, and
         | distributed computing landscape of modern times.
        
         | anonymoushn wrote:
         | You can use these patterns for per-request resources that
         | persist across some I/O calls using async if you are on an old
         | version of Zig or using zigcoro while you wait for the feature
         | to return to the language. zigcoro's API is designed to make
         | the eventual transition back to language-level async easy.
        
       | sbussard wrote:
       | Has anyone here used zig with Bazel?
        
         | hansvm wrote:
         | Not me, not yet, and it's been a few years since I've used
         | Blaze.
         | 
         | It ought to be fairly straightforward. Zig is an easy
         | dependency to either vendor or install on a given system/code-
         | base (much more painful currently if you want Blaze to also
         | build Zig itself), and at a bare minimum you could just add
         | BUILD steps for each of the artifacts define in build.zig.
         | 
         | Things get more interesting if you want to take advantage of
         | Zig's caching, especially once incremental compilation is fully
         | released. It's a fast enough compilation step that perhaps you
         | could totally ignore Zig's caching for now and wait to see how
         | that feature shapes up before making any hard decisions, but my
         | spidey senses say that'll be a nontrivial amount of work for
         | _somebody_ to integrate those two ideas.
        
       ___________________________________________________________________
       (page generated 2024-06-15 23:00 UTC)