[HN Gopher] Leveraging Zig's Allocators
___________________________________________________________________
Leveraging Zig's Allocators
Author : PaulHoule
Score : 159 points
Date : 2024-06-14 17:34 UTC (1 days ago)
(HTM) web link (www.openmymind.net)
(TXT) w3m dump (www.openmymind.net)
| lionkor wrote:
| I love the way Zig does allocators, when you compare it to Rust
| where allocation failures just panic (rolls eyes)
| bombela wrote:
| It's getting there eventually! https://doc.rust-
| lang.org/std/boxed/struct.Box.html#method.t...
| hypeatei wrote:
| I agree, the lack of control is frustrating but on the
| contrary: how much software is actually going to do anything
| useful if allocation is failing? Designing your std library
| around the common case then gathering input on what memory
| fallible APIs should look like is smarter IMO.
| hansvm wrote:
| Most problems have speed/memory/disk tradeoffs available.
| Simple coding strategies include "if RAM then do the fast
| thing, else do the slightly slower thing", "if RAM then
| allocate that way, else use mmap", "if RAM then continue,
| else notify operator without throwing away all their work",
| ....
|
| Rust was still probably right to not expose that at first
| since memory is supposed be fairly transparent, but Zig
| forces the user to care about memory, and given that
| constraint it's nearly free to also inform them of problems.
| The stdlib is already designed (like in Rust) around
| allocations succeeding, since those errors are just passed to
| the caller, but Zig can immediately start collecting data
| about how people use those capabilities. At a language level,
| including visibility into allocation failures was IMO a good
| idea.
| kibwen wrote:
| The Rust standard library aborts on allocation failure using
| the basic APIs, but Rust itself doesn't allocate. If someone
| wanted to write a Zig-style library in Rust, it would work just
| fine.
| skybrian wrote:
| Is there a reason why someone wouldn't use retain_with_limit or
| is doing without it just an exercise?
| latch wrote:
| The inspiration for the post came from my httpz library. The
| fallback using a FixedBufferAllocator + ArenaAllocator is used.
| The fixed buffer is a thread local. But the arena allocators
| belong to connections, of which there could be thousands.
|
| You might have 1 fixed buffer, for N (500+) ArenaAllocators
| (but only being used per one at a time). This allows you to
| allocate a relatively large fixed buffer since you have
| relatively few threads.
|
| If you just used retain_with_limit, then you'd either have to
| have a much smaller retained size, or you'd need a lot more
| memory.
|
| https://github.com/karlseguin/http.zig/blob/c8b04e3fef5abf32...
| mikemitchelldev wrote:
| Off topic but I wish Go had chosen to use fn rather than func
| ziggy_star wrote:
| Oh hey, are you that Mitch?
|
| I literally just signed up to ask if anybody can recommend any
| good Zig codebases to read other than Tigerbeatle. How's your
| terminal going?
|
| Edit: The rest of the posted site seems like a treasure trove
| not just this one article. Was wondering how to get into Zig
| and here we are. Such kismet.
|
| Almost missed it so heads up for others.
| mikemitchelldev wrote:
| No, sorry not me. Though I have signed up for an invite for
| that terminal by Mitchell Hashimoto (I think his name is).
| slimsag wrote:
| We're working on a game engine in Zig[0]
|
| If you're looking for interesting Zig codebases to read, you
| might be interested in our low-level audio input/output
| library[1] or our module system[2] codebase - the latter
| includes an entity component system and uses Zig's comptime
| to a great degree to enable some interesting flexibility
| (dependency injection, global view of the world, etc.) while
| maintaining a great amount of type safety in an otherwise
| dynamic system.
|
| [0] https://machengine.org/
|
| [1] https://github.com/hexops/mach/tree/main/src/sysaudio
|
| [2] https://github.com/hexops/mach/tree/main/src/module
| ziggy_star wrote:
| Thanks!
| samatman wrote:
| I happen to agree. When I see 'fn' I "hear" function, when I
| see 'func' I hear "funk, but misspelled".
|
| Also, with four space indentation (or for Go, a four space
| tabset), 'func' aligns right where the code begins, pushing the
| function name off one space to the right. For 'fn' the function
| name starts one space before the code, I find this more
| aesthetic. Then again, the standard tabset is eight spaces, so
| this matters less in Go.
|
| It would be pretty silly to pick a language on the basis of
| that kind of superficial window dressing, of course. But I know
| which one I prefer.
| whobre wrote:
| I'll never understand why people care about such things.
| tapirl wrote:
| Maybe someone think "func" listens like another word.
| akira2501 wrote:
| You don't understand personal preferences? Or you don't
| understand the desire to share them with your peers? Or you
| can't understand why people don't just bully themselves into
| silence for the benefit of others?
| scns wrote:
| I like the way Kotlin did it: fun. Gives a nice alignement of
| four with the space. And functional programming can be fun.
| gizmo wrote:
| I'm not 100% sure how Zig allocators work but it looks like the
| arena memory is getting re-used without zeroing the memory? With
| slight memory corruption freed memory from a previous request can
| end up leaking. That's not great.
|
| Even if you don't have process isolation between workers (which
| is generally what you want) then you can still put memory arenas
| far apart in virtual memory, make use of inaccessible guard
| pages, and take other precautions to prevent catastrophic memory
| corruption.
| eknkc wrote:
| I guess you could place a zeroing allocator wrapper in between
| the arena and it's underlying allocator. That would write zero
| to anything that's getting freed. Arena deinit will free
| anything allocated from the underlying allocator so upon
| completion of each request, used memory would be zeroed before
| returned back to the main allocator.
|
| And that handler signature would still be the same. Which is
| the he whole point of this article so, yay.
| nurpax wrote:
| The same can happen with C malloc/free too.
| KerrAvon wrote:
| That's not specific to Zig -- local heap allocators generally
| don't zero deallocated memory -- that's a significant,
| unnecessary performance hit.
|
| If you need data to be isolated when memory is corrupt, you
| need it to be isolated always.
| 10000truths wrote:
| memset is _the_ golden example of an easily pipelined,
| parallelized, predictable CPU operation - any semi-modern CPU
| couldn 't ask for easier work to do. Zeroing 8 KB of memory
| is _very_ cheap.
|
| If we use a modern Xeon chip as an example, an AVX2 store has
| a throughput of 2 instructions / cycle. Doing that 256 times
| for 8 KB totals 128 cycles, plus a few extra cycles to
| account for the latency of issuing the first instruction and
| the last store to the L1 cache. With a 2 GHz clock frequency,
| it still takes less than 70 nanoseconds. For comparison, an
| integer divide has a worst-case latency of 90ish cycles, or
| 45ish nanoseconds.
| celrod wrote:
| This memory is now the least recently used in the L1 cache,
| despite being freed by the allocator, meaning it probably
| isn't being used again.
|
| If it was freed after already being removed from the L1
| cache, then you also need to evict other L1 cache contents
| and wait for it to be read into L1 so you can write to it.
|
| 128 cycles is a generous estimate, and ignores the costs to
| the rest of the program.
| 10000truths wrote:
| The worker is already reading/writing to the buffer
| memory to service each incoming HTTP request, whether the
| memory is zeroed or not. The side effects on the CPU
| cache are insubstantial.
| astrange wrote:
| You can use non-temporal writes to avoid this, and some
| CPUs have an instruction that zeroes a cache line. It's
| not expensive to do this.
| alexchamberlain wrote:
| This might be a stupid question, but why isn't zeroing 8KB
| of memory a single instruction? It must be so common as to
| be worthy that all the layers of memory (and indirection)
| to understand that.
| saagarjha wrote:
| Zeroing something that large is not typical. That said,
| some architectures have optimized zeroing instructions,
| such as dc zva on ARM.
| astrange wrote:
| If the memory is above the size of a page, you can tell
| the VM to drop the page and give you a new zero filled
| one instead.
| josephg wrote:
| For 8kb? Syscalling in to the kernel, updating the
| processes's memory map and then later faulting is
| probably slower by an order of magnitude or more compared
| to just setting those bytes to zero.
|
| Memcpy, bzero and friends are insanely fast. Practically
| free when those bytes are in the cpu's cache already.
| astrange wrote:
| So don't syscall. Darwin has a system similar to io_uring
| for this.
|
| (But it also has a 16KB page size.)
| toast0 wrote:
| Zeroing memory is very cheap, but not zeroing it is even
| cheaper.
|
| Zeroing memory on deallocation can be important for
| sensitive data. Otherwise, it makes more sense to zero on
| allocation _if_ you know that it 's needed because the
| allocated structure will be used without initilazation
| _and_ the memory isn 't zero by guarantee (most OSes
| guarantee newly allocated memory will be zero, and have a
| process to zero pages in the background when possible)
| 10000truths wrote:
| Sure, but in most practical applications where an HTTP
| server is involved, zeroing the request/response buffer
| memory is very unlikely to ever be your bottleneck. Even
| at 10K RPS per core, your per-request CPU time budget is
| 100 microseconds. Zeroing memory will only account for a
| fraction of a percentage of that.
|
| If you're exposing an HTTP API to clients, it's likely
| that any response's contents will contain sensitive
| client-specific data. If memory corruption bugs are more
| likely than bottlenecking on zeroing out your
| request/response buffer, then zeroing the
| request/response buffer is a good idea, until proven
| otherwise by benchmarks or profiling.
| secondcoming wrote:
| compilers are probably going to remove that memset.
| josephg wrote:
| In C, you can use explicit_bzero to make sure the
| instructions aren't removed by the optimiser:
|
| https://man7.org/linux/man-pages/man3/bzero.3.html
| keybored wrote:
| Deinit in O(1) seems to be a big attraction of arenas.
| foota wrote:
| O(1) is nice, but I feel like avoiding walking a bunch of
| data structures is maybe most important.
| elvircrn wrote:
| Any papers/blogs/SO answers covering this?
| saagarjha wrote:
| What are you looking for? Bump allocators are quite
| simple, compared to typical allocators at least.
| foota wrote:
| I don't have anything for you, but if you have some
| normally allocated hierarchal data structures in order to
| free them you'll have to go through their members, chase
| pointers, etc., to figure out the addresses to free, then
| call free on them in sequence. That's all going to be a
| lot more expensive than just memsetting a bunch of data
| to zero, which you can do at whatever the speed of your
| cores memory bandwidth is.
| josephg wrote:
| Yep. And you often don't even need to zero the data.
|
| Generally, no paper or SO answer will tell you where
| _your program_ spends its time. Learn to use profiling
| tools, and experiment with stuff like this. Try out
| arenas. Benchmark before and after and see what kind of
| real performance difference it makes in your own program.
| jedisct1 wrote:
| Zig allocators can be composed, so adding zeroization would be
| trivial.
| samatman wrote:
| I once spent an utterly baffling afternoon trying to figure out
| why my benchmark for a reverse iteration across a rope data
| structure in Julia was finishing way too fast. I was perf
| tuning it, and while it would have been lovely if my
| implementation was actually 50 times faster than reverse
| iterating a native String type, I didn't buy it.
|
| Finally figured it out: I flipped a sign in the reverse
| iterator, so it was allocating a bunch of memory and
| immediately hitting the margin of the Vector, and returning it
| with most of the bytes undefined. Why didn't I catch it sooner?
| Well, I kept running the benchmark, which allocated a reverse
| buffer for the String version, which GC released, then I ran
| the buggy code... and the GC picked up the recently freed
| _correct data_ and handed it back to me! Oops.
|
| Of course, if you want to avoid that risk in _Zig_ , you just
| write a ZeroOnFreeAllocator, which zeros out your memory when
| you free it. It's a drop in replacement for anything which
| needs an allocator, job done.
| tapirl wrote:
| If needed, you should zero memory on allocation succeeds,
| instead of zeroing it after it is freed.
| alexchamberlain wrote:
| Generally, you 0 on free in secure environments to avoid
| leaking secrets from 1 section of knowledge to the next. ie a
| request may contain a password, which the next request should
| not have access to.
| saagarjha wrote:
| Guard pages are not enough to prevent memory corruption across
| requests.
| hansvm wrote:
| In my Zig servers I'm using a similar arena-based (with
| resetting) strategy. It's not as bad as you'd imagine:
|
| The current alloc implementation memsets under the hood. There
| are ongoing discussions about the right way to remove that
| performance overhead, but safety comes first.
|
| Any sane implementation has an arena per request and per
| connection anyway, not shared between processes. You don't have
| bonkers aliasing bugs because the OS would have panicked before
| handing out that memory.
|
| Zig has a lot of small features designed to make memory
| corruption an unhappy code path. I've had one corruption bug
| out of a lot of Zig code the last few years. It was from a
| misunderstanding of async (a classic stack pointer leak
| disguised by a confusion over async syntax). It's not an issue
| since async is gone from the language, and that sort of thing
| is normally turned into a compiler error anyway as soon as
| somebody reports it.
| eknkc wrote:
| I think the last sample needs a `fba.reset()` call in between
| requests.
|
| BTW, I used zig a lot recently and the opaque allocator system is
| great. You can create weird wrappers and stuff.
|
| For example, the standard library json parser will parse json,
| deserialize a type that you requested (say, a struct). But it
| needs to allocate stuff. So it creates an arena for that specific
| operation and returns a wrapper that has a `deinit` method.
| Calling it deinits the arena so you essentially free everything
| in your graph of structs, arrays etc. And since it receives an
| upstream allocator for the arena, you could pass in any
| allocator. A fixed stack allocator if you wish to use stack
| space, another arena, maybe jemalloc wrapper. A test allocator
| that checks for memory leaks.. Whatever.
| latch wrote:
| fixed, thanks.
| hinkley wrote:
| When I read json I always end up holding onto values from some
| of the keys. Sometimes the keys too if the node is abstract
| enough.
|
| I assume the receiver then has to know it has to clone all of
| those values, yes?
|
| That seems a little tricky for general code and moreso for unit
| tests.
| anonymoushn wrote:
| You can pass the json deserializer an allocator that is
| appropriate for the lifetime of the object you want to get
| out of it, so often no copying is required.
| saagarjha wrote:
| Right, but that means you lose the simplicity and
| performance benefits of an arena allocator.
| anonymoushn wrote:
| I've mainly written unusual code that allocates a bunch
| of FixedBufferAlocators up front and clears each of them
| according to their own lifecycles. I agree that more
| typical code would reach for a GPA or something here. If
| you're using simdjzon, the tape and strings will be
| allocated contiguously within a pair of buffers (and then
| if you actually want to copy from the tape to your own
| struct containing slices or pointers then you'll have to
| decide where that goes), but the std json stuff will just
| repeatedly call whatever allocator you give it.
| forrestthewoods wrote:
| Unit tests are trivial because you can _probably_ use a
| single arena that is only reset once at the end of the test.
| Unless the test is specifically to stress test memory in some
| form.
|
| > I assume the receiver then has to know it has to clone all
| of those values, yes?
|
| The receiver needs to understand the lifetime any which way.
| If you parse a large JSON blob and wish to retain arbitrary
| key/values you have to understand how long they're valid for.
|
| If you're using a garbage collection language you can not
| worry about it (you just have to worry about other things!).
| You can think about it less if the key/values are ref-
| counted. But for most C-like language implementations you
| probably have to retain either the entire parsed structure or
| clone the key/values you care about.
| CyberDildonics wrote:
| Why wouldn't this be better done with a class that takes care
| of its memory when it goes out of scope?
| anonymoushn wrote:
| There are no automatically-invoked destructors in Zig.
| saagarjha wrote:
| Perhaps if this was added it would prove to be a better
| solution in this case?
| anonymoushn wrote:
| It would prevent you from writing the bug at the top of
| the thread.
|
| I have stopped considering this sort of thing as a
| potential addition to the language because the BDFL
| doesn't like it. So realistically we must remember to
| write reset, or defer deinit, etc. This sort of case
| hurts a little, but people who are used to RAII will
| experience more pain in cases where they want to return
| the value or store it somewhere and some other code gains
| responsibility for deinitializing it eventually.
| tapirl wrote:
| Need varies. Some memory needs to be still alive after the
| parse process.
| ctxcode wrote:
| These kind of tactics work for simple examples. In real world
| http servers you'll retain memory across requests (caches) and
| you'll need a way to handle blocking io. That's why most commonly
| we use GC'd/ownership languages for this + things like
| goroutines/tokio/etc.. web devs dont want to deal with memory
| themselfs.
| sph wrote:
| Off the top of my head, I was wondering... for software like
| web services, isn't it easier and faster to use a bump
| allocator per request, and release the whole block at the end
| of it? Assuming the number of concurrent requests/memory usage
| is known and you don't expect any massive spike.
|
| I am working on an actor language kernel, and was thinking of
| adopting the same strategy, i.e. using a very naive bump
| allocator per actor, with the idea that many actors die pretty
| quickly so you don't have to pay for the cost of GC most of the
| time. You can run the GC after a certain threshold of memory
| usage.
| jerf wrote:
| Have you looked at how Erlang does memory management within
| its processes? You definitely can "get away" with a lot of
| things when you have actors you can reasonably expect will be
| small scale, if you are absolutely sure their data dies with
| them.
| ctxcode wrote:
| If you never cache any data. Sure, u can use a bump
| allocator. Otherwise it gets tricky. I havent worked with
| actors really, but from the looks of it, it seems like they
| would create alot of bottlenecks compared to coroutines. And
| it would probably throw all your bump allocator performance
| benefits out the window. As for the GC thing. You cant 'just'
| call a GC.. Either you use a bump allocator or you use a GC.
| Your GC cant steal objects from your bump allocator. It can
| copy it... but then the reference changes and that's a big
| problem.
| anonymoushn wrote:
| I think this comment assumes that you're using one
| allocator, but it's probably normal in Zig to use one
| allocator for your caches, and another allocator for your
| per-request state, with one instance of the latter sort of
| allocator for each execution context that handles requests
| (probably coroutines). So you can just have both, and the
| stuff that can go in the bump allocator does, and
| concurrent requests don't step on each others toes.
| hansvm wrote:
| The problem _somebody_ between the hardware and your webapp
| has to deal with is fragmentation, and it's especially
| annoying with requests which don't evenly consume RAM. Your
| OS can map pages around that problem, but it's cheaper to
| have a contiguous right-sized allocation which you never re-
| initialize.
|
| Assuming the number of concurrent requests is known and they
| have bounded memory usage (the latter is application-
| dependant, the former can be emulated by 503-erroring excess
| requests, or something trickier if clients handle that
| poorly), yeah, just splay a bunch of bump allocators evenly
| throughout RAM, and don't worry about the details. It's not
| much faster though. The steady state for reset-arenas is that
| they're all right-sized contiguous bump allocators. Using
| that strategy, arenas are a negligible contribution to the
| costs of a 200k QPS/core service.
| latch wrote:
| This example came from a real world http server. Admittedly,
| Zig's "web dev" community is small, but we're trying :) I'm
| sure a lot could be improved in httpz, but it's filling a gap.
| samatman wrote:
| It scales to complex examples as well. Retained memory would be
| handled with its own allocator: for a large data structure like
| an LRU cache, one would initialize it with a pointer to the
| allocator, and use that internally to manage the memory.
|
| Blocking (or rather, non-blocking, which is clearly what you
| meant) IO is a different story. Zig had an async system, but it
| had problems and got removed a couple point releases ago.
| There's libxev[0] for evented programs, from Mitchell
| Hashimoto. It's not mature yet but it offers a good solution to
| single-threaded concurrency and non-blocking IO.
|
| I don't think Zig is the best choice for multithreaded
| programs, however, unless they're carefully engineered to share
| little to no memory (using message passing, for instance).
| You'd have to take care of locking and atomic ops manually, and
| unlike memory bugs, Zig doesn't have a lot of built-in support
| for catching problems with that.
|
| A language with manual memory allocation isn't going to be the
| language of choice for writing web servers, for pretty obvious
| reasons. But for an application like squeezing the best
| performance out of a resource-constrained environment, the
| tradeoffs start to make sense.
|
| [0]: https://github.com/mitchellh/libxev
| pjmlp wrote:
| Yes, we were creating Apache and IIS plugins 25 years ago for
| PHP, Perl, Tcl, Coldfusion, ASP (scripting COM in VB), or C++
| frameworks like ATLServer.
|
| Very few were dealing with raw memory management in C, without
| anything else.
|
| And all of this evolved into the Web application servers, and
| distributed computing landscape of modern times.
| anonymoushn wrote:
| You can use these patterns for per-request resources that
| persist across some I/O calls using async if you are on an old
| version of Zig or using zigcoro while you wait for the feature
| to return to the language. zigcoro's API is designed to make
| the eventual transition back to language-level async easy.
| sbussard wrote:
| Has anyone here used zig with Bazel?
| hansvm wrote:
| Not me, not yet, and it's been a few years since I've used
| Blaze.
|
| It ought to be fairly straightforward. Zig is an easy
| dependency to either vendor or install on a given system/code-
| base (much more painful currently if you want Blaze to also
| build Zig itself), and at a bare minimum you could just add
| BUILD steps for each of the artifacts define in build.zig.
|
| Things get more interesting if you want to take advantage of
| Zig's caching, especially once incremental compilation is fully
| released. It's a fast enough compilation step that perhaps you
| could totally ignore Zig's caching for now and wait to see how
| that feature shapes up before making any hard decisions, but my
| spidey senses say that'll be a nontrivial amount of work for
| _somebody_ to integrate those two ideas.
___________________________________________________________________
(page generated 2024-06-15 23:00 UTC)