[HN Gopher] Writing a simple pool allocator in C
       ___________________________________________________________________
        
       Writing a simple pool allocator in C
        
       Author : 8dcc
       Score  : 254 points
       Date   : 2025-01-05 23:07 UTC (3 days ago)
        
 (HTM) web link (8dcc.github.io)
 (TXT) w3m dump (8dcc.github.io)
        
       | pajko wrote:
       | ThreadX has pool allocators: https://github.com/eclipse-
       | threadx/threadx/blob/master/commo...
        
         | kazinator wrote:
         | Everyone and his dog who has programmed C for a few decades has
         | several as well. :)
        
       | liontwist wrote:
       | The pool struct Is two pointers, why are you allocating it with
       | Malloc?
        
         | o11c wrote:
         | Some people are allergic to `main() { Foo foo; foo_init(&foo);
         | }`
         | 
         | Admittedly, sometimes for good reason - it locks in the ABI.
        
           | MathMonkeyMan wrote:
           | Is that why some structs have "char reserved[16]" data
           | members? I think I saw this in NGINX, though that might have
           | been to allow module compatibility between the open source
           | offering and the paid version.
        
             | o11c wrote:
             | Yes, that slightly improves ABI flexibility, though the
             | fact that callers can still access fields limits that.
             | 
             | An alternative is to make the _only_ member `char
             | opaque[16]` (IIRC some locale-related thing in glibc does
             | this, also I think libpng went through a transition of some
             | sort related to this?), but calculating the size can be
             | difficult outside of trivial cases since you can 't use
             | sizeof/offsetof.
        
           | liontwist wrote:
           | ABI? Better make it an anonymous struct too.
           | 
           | It's a performance oriented custom memory allocator.
        
           | kazinator wrote:
           | Mainly just the size and alignment. If you make the structure
           | oversized, and with strict alignment, it can be future
           | proofed. Old binary clients will provide a structure big
           | enough and aligned enough for the new version.
           | 
           | The dynamic constructor API can allocate the structure
           | exactly sized without the padding.
        
         | unwind wrote:
         | Next question: why are there two calls to `malloc()`, one for
         | the Pool structure and one for the Chunks?
         | 
         | It is trivial to co-allocate them which removes the risk of
         | memory fragmentation and is just generally a good idea if
         | you're chasing performance.
         | 
         | The answer might be "for clarity/simplicity" which I guess is
         | fine in an informative article, but it could at least be
         | highlighted in the text which I didn't see.
        
           | 8dcc wrote:
           | I will also take note of this, you are right.
        
           | jdnxbdn wrote:
           | It is certainly not trivial and would add little to the topic
        
           | kevin_thibedeau wrote:
           | He's targeting ANSI C, C89. Flexible array members are not
           | supported officially until C99. 26 years later, it's time to
           | stop clinging to outdated standards and crusty tooling. Even
           | Linux had to read the room and adopt C11.
           | 
           | A C11 implementation could go one step further and use
           | _Alignas(max_align_t) to keep the pool array aligned with no
           | manual effort. The double allocation does this implicitly.
        
             | liontwist wrote:
             | The pool needs to align every allocation anyway.
        
         | lyu07282 wrote:
         | Couldn't this also just syscall mmap directly? I mean malloc
         | itself is a memory allocator it feels a bit strange to use it
         | in an allocator implementation, but perhaps I'm missing
         | something.
        
       | keyle wrote:
       | The parent article is quite interesting and slightly different
       | being cpp [1]. There is also a post about writing a memory
       | allocator [2].
       | 
       | [1]: http://dmitrysoshnikov.com/compilers/writing-a-pool-
       | allocato...
       | 
       | [2]: http://dmitrysoshnikov.com/compilers/writing-a-memory-
       | alloca...
        
       | 10000truths wrote:
       | I recommend against using linked lists for bookkeeping the free
       | blocks. It seems to be the data structure that every malloc/free
       | implementation reaches for, and I don't know why - the slowness
       | of pointer-chasing makes it terrible for almost any real-world
       | use case. A balanced tree would be a much better idea, given that
       | all the essential operations would take O(log n) time instead of
       | O(n). Even if one insists on a linear search, a bitset is much
       | more cache friendly than pointer-chasing and it can trivially
       | benefit from SIMD optimizations.
        
         | ben-schaaf wrote:
         | I don't think you understand how the allocator in the article
         | works. Allocating and freeing are already O(1), creating and
         | closing the allocator are necessarily O(n). There is no search
         | being done here.
        
         | o11c wrote:
         | For the `Chunk` list, this isn't one of the cases where linked
         | lists are harmful. Each use _only_ touches the top of the
         | stack, never iterates. Also, linked lists are much easier to
         | make thread-safe.
         | 
         | For the `LinkedPtr` list, the bad case is only hit during the
         | destruction of the pool, and then only if you allocate a lot of
         | memory. And given the overhead of deallocation I'm not sure the
         | array approach would measurably help.
         | 
         | I don't see anywhere a binary tree search would be useful here,
         | since there are no loops used for lookup (on allocation they're
         | pushed in order, but when freed chunks are pushed in arbitrary
         | order; this does mean no double-free protection).
        
           | harrison_clarke wrote:
           | why would you need to iterate on destruction? you free the
           | whole pool at once, since you allocate it all at once
        
         | elchananHaas wrote:
         | The reason linked lists are used is for large enough
         | allocations, there is no overhead. You use the space the
         | application isn't using. In addition, if all allocations are
         | the same size it is O(1), you just look at the head of the
         | list.
         | 
         | More sophisticated strategies bucket allocations by size, this
         | has fixed overhead. You can also use balanced trees for more
         | memory efficiency, but this is slower.
         | 
         | For small allocations (8 bytes) that are too small to contain
         | pointers allocators will allocate a block and use bitsets.
        
         | JJOmeo wrote:
         | Pool allocators don't walk the list or search anything though.
         | All interactions are only at the list head and O(1), as all
         | free nodes are just that, free and equal.
        
           | ceving wrote:
           | Then it might be a misnomber. Calling it "stack" instead of
           | "list" might be better.
        
             | alextingle wrote:
             | That is a fair comment.
        
               | cb321 wrote:
               | While the most precise naming might be "a singly linked
               | list representation of a stack", "free list"
               | (https://en.wikipedia.org/wiki/Free_list) is an ancient
               | term of art for this problem.. so much so that wikipedia
               | (at least the version today) even suggests "freelist" all
               | one word as a spelling.
               | 
               | The initial linking together of all free space (also
               | cache friendly, btw) is often called "threading the free
               | list", although this terminology is less standard than
               | "free list" itself.
        
         | MathMonkeyMan wrote:
         | Which pool operation is made more costly by the linked list?
         | Both allocate and free are constant time, and trivial too.
         | 
         | The only thing that I can imagine being faster is a bump
         | allocator, but that precludes free.
        
         | kazinator wrote:
         | Suppose the information did happen to search through the free
         | blocks. Suppose you put them into an array instead of a linked
         | list. They can't actually be in the array, you see? The blocks
         | are in the pool heap wherever they happen to be. So the array
         | has to point to them. As you walk through the array you have to
         | do reference the pointers. The only way it's better is that
         | they are not dependent loads. But you have this array to manage
         | now.
        
       | ch33zer wrote:
       | Some interesting things that would be interesting to add to this:
       | 
       | - thread safety ( not too hard to add on top of your liked list)
       | - variable sized allocations. Ideally something more akin to
       | malloc. Could be done wastefully by rounding up to the nearest
       | chunk size - (as mentioned in the article) variable chunk sizes -
       | zeroing of memory before handing it back to users - double free
       | detection or even better full asan support
        
         | 8dcc wrote:
         | I don't currently have much experience with thread safety, but
         | it's something I should look into. Thank you.
        
       | kazinator wrote:
       | If you write your own allocator in C, do yourself a favor and use
       | the valgrind API inside it. Its use can be conditional so you can
       | "compile it out". Or should I say it has to be, so you can build
       | the code in an environment where there's no valgrind API.
        
         | gopalv wrote:
         | +1 - this isn't even that hard, it is include a single header +
         | annotate allocations/frees, with some redzones around the
         | actual allocation to know when the user-code is writing into
         | the pool pointer areas.
         | 
         | https://developers.redhat.com/articles/2022/03/23/use-valgri...
        
           | hooli_gan wrote:
           | https://en.wikipedia.org/wiki/Red_zone_(computing)
        
           | dsp_person wrote:
           | oof love on this page how scrolling hijacks the back button
        
         | 8dcc wrote:
         | I have never heard about this (using valgrind with a "non-
         | standard allocator"), but it looks really interesting,
         | specially for other projects I am working on. I will definitely
         | look into it.
        
           | mananaysiempre wrote:
           | You can also integrate with AddressSanitizer to some extent,
           | look into[1] ASAN_{UN,}POISON_MEMORY_REGION from
           | <sanitizer/asan_interface.h> or the underlying intrinsics
           | __asan_{un,}poison_memory_region.
           | 
           | [1] https://github.com/google/sanitizers/wiki/AddressSanitize
           | rMa...
        
           | kazinator wrote:
           | I integrated the TXR Lisp garbage collector with valgrind.
           | That came in handy on a number of occasions.
           | 
           | It's not always the case that you want special behavior for
           | the garbage collector to assist with debugging, but also
           | this: if you want to use Valgrind to debug anything that does
           | _not_ have to do with the garbage collector, you want all the
           | false positives generated by it out of your way.
           | 
           | Some of the scanning that a conservative garbage collector
           | has to do looks like access errors to Valgrind. By using the
           | Valgrind API, you can grant yourself access to that memory,
           | and then lock it again when done.
           | 
           | That aspect doesn't necessarily apply to a pool allocator
           | being described in this article; that should not be
           | triggering any false positives.
        
             | 8dcc wrote:
             | (In case you missed my other post, I ended up adding
             | valgrind support in the article and in the 'libpool'
             | project.)
             | 
             | > I integrated the TXR Lisp garbage collector with
             | valgrind.
             | 
             | That's funny, because when I said "for other projects I am
             | working on", I specifically meant a Lisp interpreter which
             | uses a memory pool and garbage collection (I am still
             | working on implementing this last part, anyway). I am very
             | happy about the fact that I can continue using valgrind in
             | this project.
        
         | kragen wrote:
         | Thanks for the tip!
        
         | tavianator wrote:
         | I've replaced Valgrind with sanitizers for most of my workflow.
         | Using the sanitizer APIs in your custom allocator is also easy:
         | https://github.com/tavianator/bfs/blob/main/src/sanity.h
         | https://github.com/tavianator/bfs/blob/31dffd6441ba80ea998ce...
        
         | lsllc wrote:
         | The Address Sanitizer (ASAN) in both clang and gcc is/are
         | excellent:
         | 
         | https://github.com/google/sanitizers/wiki/addresssanitizer
         | 
         | I believe to integrate a custom allocator with ASAN, you need
         | to look at the memory "poisoning" mechanisms:
         | 
         | https://github.com/google/sanitizers/wiki/AddressSanitizerMa...
        
       | accelbred wrote:
       | Note that using this will likely violate strict aliasing due to
       | using the void pointer returned as one type, freeing it, and then
       | getting that same chunk again and using it as another type.
       | You'll probably want to compile with -fno-strict-aliasing to be
       | safe.
       | 
       | A good reference on strict aliasing:
       | https://gist.github.com/shafik/848ae25ee209f698763cffee272a5...
        
         | leni536 wrote:
         | That's what allocators do. If C's object model doesn't allow
         | users of the language to write their own allocators then that
         | object model is broken.
         | 
         | C++ relatively has fixes to allow allocators to work, it
         | requires calls to std::launder.
        
           | dwattttt wrote:
           | I understand the C standard hides such actions behind a wall
           | of "this is the allocator", and expected the compiler authors
           | to also be the allocator authors, allowing them to know
           | when/how they can break such rules (in the context of their
           | own compiler)
        
             | unwind wrote:
             | No, allocators are not magic (in that regard) in C. There
             | is nothing out of the ordinary going on with an allocator,
             | the parent comment is simply mistaken (as pointed out by
             | another answer).
        
               | dwattttt wrote:
               | Ah, I see that it's because the char type never violates
               | Strict Aliasing. I was wondering how you could define a
               | type such as Chunk, yet hand out a pointer to it to a
               | user who will cast it to some other type.
        
               | SleepyMyroslav wrote:
               | Well the pool code fails to use char* type to write
               | inside block of memory.
               | 
               | If you look at 'pool_free' code you can see that it
               | receives used block from user as 'ptr' then casts it to
               | 'Chunk pointer' and writes to into Chunk member value of
               | type 'Chunk pointer'. I had to change that line to memcpy
               | when I did it last time around 15 years ago in C++ with
               | GCC 4.something. In short if you are writing your own
               | allocators you need either to disable strict aliasing
               | like Linux does or do the dance of 'memcpy' when you
               | access memory as 'internal allocator structures' right
               | after it was used as user type. When it happened to me
               | write was reordered and I observed code that writes into
               | Chunk* next executed first and write of zeros into user
               | type second.
        
               | uecker wrote:
               | Note that C++ has different rules than C. C has type
               | changing stores that do not require memcpy for allocated
               | storage (i.e. no declared type). Older compilers have
               | plenty of bugs related to TBAA though. Newer versions of
               | GCC should be ok.
        
           | pjmlp wrote:
           | Which is why pointer provenance is an issue as well.
        
         | alextingle wrote:
         | There's no reason an allocator should ever trip over strict
         | aliasing rules.
         | 
         | malloc() returns an address. The user code writes a value, and
         | then reads a value of the same type. No problem.
         | 
         | Later, malloc() returns the same address. The user code writes
         | a value of another type then reads a value of that same type.
         | Again, no problem. There's no aliasing going on here.
         | 
         | The act of writing a value of a different type tells the
         | compiler that the lifetime of the previous object has ended.
         | There's no special magic required.
         | 
         | Strict aliasing is about situations where we write a value of
         | one type, and then attempt to read a value of a different type
         | from the same location. If you want to do that, then you have
         | to be extremely cautious. But memory allocators don't do that,
         | so it's not an issue.
         | 
         | (Obviously just talking about C here, or POD in C++ terms. If
         | you are dealing with C++ objects with destructors, then you
         | have an extra layer of complexity to deal with, but again
         | that's nothing to do with aliasing.)
        
           | clairebyte wrote:
           | >The act of writing a value of a different type tells the
           | compiler that the lifetime of the previous object has ended.
           | 
           | afaik only memcpy has that magic property, so I think parent
           | is almost correct.                 void *p = malloc(n);
           | *(int *)p = 42; // ok, *p is now an int.       //*(float *)p
           | = 3.14f; // I think this is not allowed, p points to an int
           | object, regular stores do not change effective type
           | float x = 3.14f;       memcpy(p, &x, sizeof(float)); // but
           | this is fine, *p now has effective type float
           | 
           | So in the new, pool_new:
           | pool->chunk_arr[i].next = &pool->chunk_arr[i + 1];
           | 
           | This sets the effect type of the chunk block to 'Chunk'
           | 
           | Later in pool_alloc:                 Chunk* result    =
           | pool->free_chunk;       ...       return result;
           | 
           | _result has effective type 'Chunk'
           | 
           | In user code:                 int *x = pool_alloc();       *x
           | = 42; // aliasing violation, *x has effective type 'Chunk'
           | but tried to access it as an int*
           | 
           | User code would need to look like this:                 int
           | *x = pool_alloc();       memcpy(x, &(int){0}, sizeof(int));
           | // establish new effective type as 'int'       // now we can
           | do       *x = 42;**
           | 
           | _
        
             | bigpingo wrote:
             | And this is why type based alias analysis (TBAA) is insane
             | and why projects like linux complies with fno-strict-
             | aliasing.
             | 
             | C should issue a defect report and get rid of that nonsense
             | from the standard.
        
               | astrange wrote:
               | C doesn't have "alias analysis" in the standard. It has
               | an (informally specified) memory model which has "memory
               | objects" which have a single type, which means treating
               | them as a different type is undefined behavior.
               | 
               | This enables security analysis like valgrind/ASan and
               | secure hardware like MTE/CHERI so it's very important and
               | you can't get rid of it.
               | 
               | However, it's not possible to implement malloc() in C
               | because malloc() is defined as returning new "memory
               | objects" and there is no C operation which creates
               | "memory objects" except malloc() itself. So it only works
               | as long as you can't see into the implementation, or if
               | the compiler gives you special forgiveness somehow.
               | 
               | C++ has such an operation called "placement new", so you
               | want something like that.
        
               | kevin_thibedeau wrote:
               | You can definitely implement malloc in C. It does nothing
               | special in its most basic form but cough up void pointers
               | into its own arena.
               | 
               | It gets complicated when you have virtual memory and an
               | OS involved but even then you can override the system
               | malloc with a simple implementation that allocates from a
               | large static array.
        
             | oguz-ismail wrote:
             | > aliasing violation, *x has effective type 'Chunk'
             | 
             | This doesn't make any sense. How do you know its effective
             | type if you don't have access to the definition of
             | `pool_alloc()'?
        
               | clairebyte wrote:
               | If you can guarantee its always compiled in a separate TU
               | and never inlined, sure, might be a practical way so
               | 'solve' this issue, but if you then do some LTO (or do a
               | unity build or something) the compiler might suddenly
               | break your code. Another way is to add an inline asm
               | block with "memory" clobber to escape the pointer so the
               | optimizer can't destroy your code.
               | 
               | It's really quite ridiculous that compiler implementers
               | have managed to overzealously nitpick the C standard so
               | that you can't implement a memory allocator in C.
        
               | astrange wrote:
               | > It's really quite ridiculous that compiler implementers
               | have managed to overzealously nitpick the C standard so
               | that you can't implement a memory allocator in C.
               | 
               | This is good because it's also what gives you valgrind
               | and CHERI. Take one away and you can't have the other.
               | 
               | (Because if you define all undefined behavior, then
               | programs will rely on it and you can't assume it's an
               | error anymore.)
        
             | uecker wrote:
             | Any type-changing store to allocated storage has this
             | property in C.
        
             | astrange wrote:
             | There are other issues besides changing the memory type.
             | For instance, C has those rules about out of bounds
             | pointers being undefined, but you can't implement that - if
             | you return part of the pool and someone calculates an out
             | of bounds address they're getting a valid address to the
             | rest of the pool. That's why you can't implement malloc()
             | in C.
             | 
             | (The difference here is that system malloc() works with
             | valgrind, -fbounds-safety, theoretical secure hardware with
             | bounds checking etc., and this one doesn't.)
        
           | gpderetta wrote:
           | > The act of writing a value of a different type tells the
           | compiler that the lifetime of the previous object has ended.
           | There's no special magic required.
           | 
           | I think strictly speaking in C this is only true for
           | anonymous memory, i.e. memory you got from some allocator-
           | like function. So if your custom allocator gets its memory
           | from malloc (or sbrk, or mmap), everything is fine, but, for
           | example, you are allocating from a static char array, it is
           | formally UB.
           | 
           | In C++ you can use placement new to change the type of
           | anything.
        
             | uecker wrote:
             | Yes, type-changing stores are guaranteed by the C standard
             | to work only for allocated storage. In practice no compiler
             | makes a difference between declared and allocated storage.
             | A bigger problem is that Clang does not implement the C
             | rules as specified..
        
       | 8fingerlouie wrote:
       | A long time ago, in a galaxy far far away, i wrote something
       | similar for an embedded platform. I was implementing a WAP
       | Browser at the time (yes, WAP,
       | https://en.wikipedia.org/wiki/Wireless_Application_Protocol), and
       | we needed dynamic memory allocation in an environment where
       | everything was static due to realtime constraints.
       | 
       | So i ended up with this : https://github.com/jinie/memmgr
        
       | guardian5x wrote:
       | The Latin Modern font on this website does not look so good in
       | Chrome, is it because of the font size, or is it just me?
       | (Chrome)
        
         | johnisgood wrote:
         | I was going to say this. Looks awful to me as well. It is the
         | font-family.
        
         | dmonitor wrote:
         | Sounds like a Windows issue based on other people in the
         | thread. I'm on Firefox with the same issue. Removing Latin-
         | modern from the CSS file made it much more readable for me.
        
       | 9029 wrote:
       | Another (admittedly less portable) way to solve the resizing
       | problem is to reserve a large virtual memory space using linux
       | mmap with the PROT_NONE flag and then commit pages as needed
       | using mprotect with PROT_READ|PROT_WRITE flags. I believe
       | windows' VirtualAlloc/VirtualProtect also allows this
        
         | jeffbee wrote:
         | You don't need to do anything with the flags. You can just mmap
         | a huge virtual area and the pages will be realized whenever you
         | first touch them.
        
           | 9029 wrote:
           | Today I learned, thanks
        
       | cb321 wrote:
       | One aspect people don't tend to mention (and TFA does not because
       | it has "general purpose" as a scope) about doing your own
       | allocation arenas is that because a "pointer" is just the name
       | (really number) of an allocated region, this also affords
       | flexibility in the width of those numbers aka "pointer size" that
       | decides your "address" space.
       | 
       | So, for example, if you are willing to limit yourself to 64
       | KiNodes of equal sized objects, a pool like
       | https://github.com/c-blake/bst/blob/main/POOL.c can use an
       | unsigned short 2-byte pointer. This small address space
       | specialization can give a full 4x space overhead optimization
       | over modern 64-bit pointers.
       | 
       | You could even use 8x less with 1-byte pointers if 256 nodes is
       | enough or use bit fields or equivalents for even more fine-
       | grained address space size selection. Admittedly, linked
       | structures are usually less interesting for very small address
       | space sizes, but this is sometimes useful. People may complain
       | about arbitrary limits that relate to this kind of optimization,
       | but the linked part _could_ be  "small by design" without
       | compromising overall scale { such as for structure _within_ a
       | B-tree node or other multi-level indexing setups }.
       | 
       | In any event, I think it is useful to highlight pointer size
       | arbitrariness to junior engineers. It is often learned once &
       | immediately forgotten behind an onslaught of general purpose-ness
       | (& pointer type systems/syntax). Relocation ideas also relate.
       | E.g., if all the pointers are implicitly indices into a (small?)
       | array, and all the code needs a "base address" of that array to
       | get a VMem address, then you can relocate the whole bundle of
       | objects pretty easily changing only the base address.
        
         | 8dcc wrote:
         | Yes, this is true. I didn't use that in the code because I
         | thought it would make it less readable, though. I might mention
         | it somewhere, thank you.
        
           | cb321 wrote:
           | In many prog.lang's, it absolutely makes things less
           | readable. This is not an accident, but because the PL itself
           | specifically tries to syntactically support pointer types
           | (i.e. to literally aid readability). Most PLs do so only for
           | general purpose VMem pointers with user-defined pointers
           | being afterthoughts (at best). You probably need (at least!)
           | operator overloading to get the same level of readability as
           | "PL-native" pointers { though "readability" is a bit
           | subjective and powerful PL features can be divisive; Scare
           | "quotes" for interpretation hazard :-) }.
           | 
           | Anyway, you probably knew all that & I don't mean to suggest
           | otherwise, but it made sense to write since HN loves PL talk.
        
         | kps wrote:
         | 'Handles are the better pointers'1 is a related writeup that
         | has been discussed here before2.
         | 
         | 1 https://floooh.github.io/2018/06/17/handles-vs-pointers.html
         | 
         | 2 https://news.ycombinator.com/item?id=36419739
        
       | matheusmoreira wrote:
       | This is really informative. I wonder if there is a better
       | solution for the reallocation problem, one that preserves
       | pointers.
        
       | jll29 wrote:
       | This code is very beautifully written, thanks for sharing.
       | 
       | However, you should consult the book "C: Interfaces and
       | Implementations" by Dave Hanson, which has a library called
       | "Arena" that is very similar to your "Pool"s, and it shows a few
       | more tricks to make the code safer and easier to read (Chapters
       | 5+6, 6 in particular).
       | 
       | D. Hanson's book (written in the 'literary programming' style
       | invented by Donal Knuth) also demonstrates having debugging and
       | production implementations for the same C API to catch memory
       | errors, and his book is from 1997:                 > Copyright
       | (c) 1997 by David R. Hanson
       | 
       | (He used to be a SE professor I Princeton before getting hired by
       | Microsoft, I believe.)
        
         | 8dcc wrote:
         | Thank you so much! I will definitely check it out. I also
         | planned on writing an article about arena allocators soon.
        
         | kragen wrote:
         | I think you mean "literate programming".
         | 
         | The arena allocator in Hanson's chapter 6 is an arena
         | allocator, not a "pool allocator" in the sense that 8dcc means,
         | which is to say, a constant-time fixed-size-block allocator. It
         | seems that you had the same reading comprehension problem that
         | I did.
         | 
         | The difference is that Hanson's arena allocator can make
         | allocations of any size (his Arena_alloc takes an nbytes
         | argument) and cannot deallocate individual allocations, only
         | the entire arena (his Arena_free takes only an arena pointer as
         | a parameter). A fixed-size-block allocator like the one we are
         | discussing here can only make allocations of a fixed size, but
         | can recycle them individually, not only all at once.
         | 
         | (If Hanson's name sounds familiar, it might be because you've
         | read the lcc book.)
         | 
         | It may be worth noting that Donal _d_ Knuth also invented the
         | font used in 8dcc 's article.
        
           | actionfromafar wrote:
           | Honestly, "literary programming" does better explain how
           | Knuth writes code.
        
             | kragen wrote:
             | Yes, but that name lacks the benefit of implicitly
             | describing other approaches to programming as "illiterate".
        
               | actionfromafar wrote:
               | Haha, never thought about that!
        
               | kragen wrote:
               | Knuth did.
        
               | kevin_thibedeau wrote:
               | Reading the Metafont source is pretty challenging because
               | of all the transclusions. There's a reason why programs
               | aren't written like journal papers.
        
               | kragen wrote:
               | I think we can probably do better than METAFONT with the
               | benefit of hindsight. Unit tests would help with
               | comprehensibility. Including example outputs, too, like
               | Darius Bacon's "halp", or R-Markdown, Jupyter, doctest,
               | and org-mode can do. Higher-level languages than Pascal
               | would probably be an improvement. Maybe building up the
               | program as a series of versions, like Kartik Agaram's
               | idea of "layers"; I did something similar but less
               | complete in my literate-programming system "handaxeweb".
               | We can do decent typography on a computer monitor now, so
               | in the worst case, we can attempt to leverage
               | interactivity, as so-called "explorable explanations" do.
               | 
               | Both programs and journal papers are mostly written the
               | way they are because of tradition, not because it's the
               | best way we've found to write them. Two examples:
               | 
               | Nobody would intentionally design a programming language
               | to make it impossible to include the kind of diagrams
               | that illustrate this article, but most programming
               | languages are in fact designed in such a way.
               | 
               | Buckminster-Fuller-style "ventilated prose" turns out to
               | be more readable than traditional paragraph formatting in
               | psychology experiments, which is why we use it
               | universally in our programs. But you can't publish a
               | paper formatted that way unless you call it a poem.
        
         | johnisgood wrote:
         | My minor nitpick regarding the code is the choice of variable
         | names, but other than that, it is indeed easy to read and it is
         | how I implement data structures and whatnot.
         | 
         | I like that there is a more or less exhaustive blog post about
         | it which happens to look nice as well.
        
         | cb321 wrote:
         | Kernighan & Ritchie's "The C Programming Language" also has an
         | example implementation of malloc() & free() { for any-size
         | requests as @kragen notes in sibling } in it in like 3 pages
         | (though, sure, it is far from the most efficient
         | implementation). Since it's the last section of the main text,
         | I'd guess that most readers don't get that far (even though
         | it's a short book).
        
           | IgorPartola wrote:
           | Even if you don't program in C, I think K&R should be
           | required reading for anyone who writes software for a living.
           | You can finish it in an afternoon, or two days if you
           | actually write and run the code outlined in the book.
           | 
           | My bucket list includes an item to write a novel efficient
           | implementation of malloc/free. I have taken several stabs at
           | this in the past but have never been happy with it. I have no
           | real need of a custom allocator but it is something that I
           | think would be very cool to have under my belt. This was
           | inspired by K&R.
        
       | exDM69 wrote:
       | Here's another interesting O(1) memory allocator but with
       | arbitrary sized allocations and low fragmentation. Negative side
       | is relatively high memory overhead (a few dozen bytes per
       | allocation).
       | 
       | This kind of allocator is often used to suballocate GPU memory in
       | game and graphics applications.
       | 
       | I'm using a variant of this algorithm with added support for
       | shrinkable and aligned allocations and flexible bin sizing.
       | 
       | You can also extend this idea to two dimensions to create texture
       | atlases, which is possible in O(1) for power of two sized
       | allocations.
       | 
       | Original: https://github.com/sebbbi/OffsetAllocator Rust port:
       | https://crates.io/crates/offset-allocator
        
       | kragen wrote:
       | This is a very nice article! The diagrams in particular are very
       | clear. (They seem to have been done in draw.io:
       | https://github.com/8dcc/8dcc.github.io/blob/main/img/pool-al...
       | which seems to be a pretty decent Apache-2.0 free software
       | licensed diagram editor: https://github.com/jgraph/drawio/)
       | 
       | I think it would be improved further if it began with the calling
       | interface callers use to invoke the allocator, because it's
       | easier to understand the implementation of some service such as
       | pool allocation when you begin by knowing what service, exactly,
       | is required. I began with the assumption that the author was
       | describing an arena allocator under another name, rather than a
       | fixed-size block allocator, both because the APR's influential
       | arena allocator is called "apr_pool_t" and because one of the
       | main headings in the table of contents (a greatly appreciated
       | feature, like all the typography of the article) was "Closing the
       | pool". It took me quite a while to figure out that I was
       | mistaken. It's plausible that I only had this problem because I
       | am an unusually stupid, ignorant, or pigheaded person (some would
       | say all three), but, if not, many other people will have the same
       | difficulty I had of taking several minutes to figure out what the
       | topic of the article is. (I could have saved myself by clicking
       | either of the first two links in the Introduction, but I didn't
       | because I thought I already knew what it was.)
       | 
       | The current top-voted comment on this thread is by someone who
       | had the same reading comprehension problem I did, but never
       | realized their erorr:
       | https://news.ycombinator.com/item?id=42643951 and nobody had
       | pointed out their mistaek until I did just now. So I think we can
       | say with some confidence that many HN readers will have the same
       | difficulty.
       | 
       | I think that the article would also be improved by a few other
       | small additions:
       | 
       | - It says, "The pool allocator, however, is much faster than
       | malloc", but doesn't provide any benchmarks or even an order of
       | magnitude.
       | 
       | - Probably when one benchmarks the implementation given, one will
       | find that it is only slightly faster than jemalloc or even
       | gnumalloc, because they both dispatch small allocations to a
       | fixed-block allocator similar to this one, and the dispatch isn't
       | very expensive. This can be fixed by inlining the allocation fast
       | path and the deallocation function, which will then compile down
       | to a few instructions each. A simple demonstration of how to do
       | this is in my arena allocator kmregion in
       | http://canonical.org/~kragen/sw/dev3/kmregion.h and
       | http://canonical.org/~kragen/sw/dev3/kmregion.c (GPLv3+). The
       | current implementation of 8dcc libpool is guaranteed to be unable
       | to do this unless you're using link-time optimization or a so-
       | called "unity" or "amalgamation" build process.
       | 
       | - It's worth distinguishing between average-case performance (in
       | which this allocator is _slightly_ better than malloc) and worst-
       | case performance (in which case malloc, generally speaking, isn
       | 't even in the running).
       | 
       | - It may be worth mentioning that often such pool allocators are
       | used in conjunction with initializing the objects in the pool to
       | a valid state when the pool is created; that way you don't have
       | to reinitialize objects to a valid state every time you
       | deallocate and reallocate them, which is usually more work than
       | malloc does to find the right free list. This does require not
       | overwriting the user data with your freelist pointer, so it's a
       | space/time tradeoff.
       | 
       | - Maybe you should mention alignment, especially now that even on
       | amd64 GCC has taken to using new instructions that require
       | alignment.
       | 
       | Even as it is, though, it's highly educational, visually
       | pleasant, and very understandable (once I got over my initial
       | misconception of having a totally wrong idea of what the article
       | was about).
        
         | 8dcc wrote:
         | > The diagrams in particular are very clear. (They seem to have
         | been done in draw.io:
         | https://github.com/8dcc/8dcc.github.io/blob/main/img/pool-al...
         | which seems to be a pretty decent Apache-2.0 free software
         | licensed diagram editor: https://github.com/jgraph/drawio/)
         | 
         | Yes, this is true. Furthermore, the diagram information for
         | draw.io is included in the SVG images themselves, so one could
         | edit them there.
         | 
         | > It says, "The pool allocator, however, is much faster than
         | malloc", but doesn't provide any benchmarks or even an order of
         | magnitude.
         | 
         | You are right, I was referring to "malloc-like" allocators
         | (i.e. for variable-size blocks). It would be a good idea to
         | benchmark them.
        
           | kragen wrote:
           | In general when you're working on an optimization (and this
           | is an optimization) benchmarking it is essential. About a
           | third of the optimizations I try end up making my code slower
           | instead of faster, often because of performance
           | considerations I hadn't thought of. You can't just say "it's
           | much faster" without measuring.
        
             | 8dcc wrote:
             | Yes, you are right. Do you have any suggestions on how I
             | should do the benchmarking? I am not sure how this is
             | usually done, but if the output is some kind of graph, I
             | would also like to know how these graphs are usually
             | produced.
        
               | kragen wrote:
               | I'm no expert on this, so probably someone else can give
               | you better advice, but I'm happy to share what I do know.
               | Here are the most important points.
               | 
               | Benchmarking is a full-fledged scientific experiment. You
               | have a hypothesis that a particular change that you can
               | make, such as using a different allocator, will have a
               | particular effect, such as making your program run in
               | less time. So you make _just that one single_ change and
               | measure the results, while carefully controlling all
               | other conditions that could _also_ affect your program 's
               | run time. You have to do the test more than once in order
               | to find out whether you've succeeded in controlling them.
               | If you do a good enough job controlling the conditions of
               | the experiment, it becomes _reproducible_ and therefore
               | falsifiable. Programming usually isn 't science, but
               | benchmarking is. A great thing about it is that each run
               | of the experiment can take a fraction of a second, so you
               | can do the whole scientific method from beginning to end
               | in a few minutes and get a reproducible, falsifiable
               | result.
               | 
               | Benchmarking can get arbitrarily sophisticated, but the
               | most important thing is to dodge the bullets that can
               | turn benchmarks into nonsense, rather than produce very
               | sophisticated graphs or get very high precision.
               | 
               | The simplest thing on Linux or macOS is to compile a
               | command-line executable you can run in a terminal and run
               | time ./testprogram
               | 
               | at least three times to get an idea of how long the
               | wallclock time is and how variable it is. (The user and
               | system times shown are not very reliable anymore.)
               | Presumably there is also some way to do the same thing on
               | Microsoft Windows. If you're on a laptop, check to see if
               | it makes a difference if you're plugged in. Also, check
               | whether the result from                   time
               | ./testprogram; time ./testprogram
               | 
               | is consistent; sometimes lags in CPU speed scaling can
               | produce unpredictable results.
               | 
               | Linux is not very consistent, so you should expect
               | variations in the range of 1-10% from run to run when
               | your program takes on the order of a second to run.
               | 
               | If you get consistently shorter wallclock times after
               | recompiling the program with a different allocator and
               | the same compiler options (probably make sure
               | optimization isn't completely turned off), you can
               | probably attribute that difference to the different
               | allocator. It's a good idea to switch back to the
               | original allocator and verify that you can reproduce your
               | original results to make sure you didn't accidentally
               | change something else at the same time (maybe your laptop
               | went into low-power mode because the battery is getting
               | low, or maybe you changed the number of iterations in a
               | loop and forgot you changed it, or maybe Firefox loaded a
               | YouTube video in the background and is making the machine
               | swap, that kind of thing).
               | 
               | It's useful to ensure that you aren't swapping and that
               | your CPU load other than the benchmark is minimal. `top`
               | (or, better, `htop`) and `vmstat 5` (or, better, `dstat
               | 5`) can be helpful here. Unless your program is heavily
               | multithreaded, the CPU load is not as crucial as it used
               | to be now that we all have multicore processors.
               | 
               | It's helpful to check the particular version of the
               | program you're benchmarking into Git, including the
               | Makefile or whatever specifies the compiler options. That
               | way when you write down your results you can refer to the
               | specific Git commit hash. Also, record your compiler
               | version, your operating system version, and your CPU.
               | Ideally someone who isn't you should be able to read your
               | notes on the benchmark, run it again, and get the same
               | results, if they have access to the same hardware.
               | 
               | The actual benchmark program itself can be as simple as
               | allocating and deallocating in a loop, ideally inside
               | another loop to run the whole benchmark multiple times;
               | but on processors with cache (which includes every PC
               | since the early 90s) the working set size can be very
               | important, and you may get significantly different
               | results for allocating 20 kilobytes 64 bytes at a time
               | and 20 gigabytes 64 bytes at a time. If you pass in the
               | number of iterations for one of the loops on the command
               | line, instead of hardcoding it in the program code, it's
               | easy to start with a small iteration count and work your
               | way up a factor of 10 at a time until the program takes a
               | few seconds to run. In this case, also be sure to record
               | the specific command line you were measuring the
               | performance of in your notes; knowing that a program took
               | 2.205-2.216 seconds to run is of no value if you don't
               | know if it was running a thousand iterations or a
               | million.
               | 
               | It's a good idea to actually compute something that you
               | output using the memory you're allocating and
               | deallocating, just in case GCC's optimizer decides that
               | your benchmark loop has no effect and therefore can be
               | safely removed from your program. (In
               | http://canonical.org/~kragen/sw/dev3/kmregion_example.c I
               | computed the sum of the data value stored in each node of
               | the 24th of all the 5000-node linked lists I allocated
               | and deallocated. That's because, when I read the
               | disassembly of my test code with `objdump -d
               | kmregion_example`, I found out that GCC had removed most
               | of the code inside the test loop because it could prove
               | that nobody ever read the struct fields I was
               | initializing. Reading disassembly is neither necessary
               | nor sufficient for dodging all bullets that invalidate
               | benchmarks.)
               | 
               | Varying the number of iterations in both the innermost
               | and outermost loop should produce roughly proportional
               | increases and decreases in wallclock time. (A helpful
               | technique is to subtract the time for 1000 iterations
               | from the time for 1001000 iterations to get the time for
               | the last million iterations.) If it doesn't, you're not
               | measuring what you think you're measuring. For example,
               | especially with JIT compilers, you may be measuring
               | startup overhead. That bit me this week with
               | http://canonical.org/~kragen/sw/dev3/hadamardsin.js,
               | which I erroneously reported was slower to run than the
               | LuaJIT version. It turns out that it's slightly faster to
               | _run_ , just much slower to _compile_.
               | 
               | Running the benchmark on more than one computer is
               | helpful because sometimes changes that speed things up on
               | one system slow them down on another. I remember a
               | recursive backtracking search program for logic circuit
               | optimization I wrote in C long ago which used longjmp()
               | to terminate the search when it found a solution; it ran
               | reasonably fast on Linux but very slowly on my friend's
               | Macintosh because MacOS implemented setjmp() as
               | sigsetjmp(), transforming it from a very cheap operation
               | into a very expensive one.
               | 
               | That's all you need to know to reliably do optimizations.
               | In fact, that's probably more than you need to know.
               | Everything below this point is extra.
               | 
               | -- *** --
               | 
               | Allocators are especially tricky to benchmark because
               | often, by changing where data is stored in memory, they
               | profoundly affect the speed of the rest of your program,
               | in ways that vary from program to program. More memory
               | fragmentation, for example, can increase how many TLB
               | misses your program runs into. I don't know what to do
               | about that except try to benchmark some real programs as
               | well as simple nested loops.
               | 
               | In cases where you're trying to measure an effect that's
               | roughly the size of your measurement precision,
               | statistical tests can be useful.
               | https://github.com/c-blake/bu/blob/main/doc/edplot.md
               | linked by cb321 above discusses using the 2-sample
               | Kolmogorov-Smirnov test, which is not nearly as
               | terrifying as it sounds. But most effects will be either
               | much larger than your measurement error, in which case
               | the result is obvious enough that you don't need the
               | statistics, or much smaller, in which case all they'll
               | tell you is that you can't be sure about the direction of
               | the effect, much less its magnitude. In medicine, where
               | your experiments take years and people die in them, using
               | sophisticated statistics to make the most of your
               | precious data is very important. In performance
               | benchmarking, where your experiments take milliseconds,
               | it's usually better to improve your measurement precision
               | instead. Sometimes just running the same benchmark more
               | times is enough, but there are more powerful techniques
               | available.
               | 
               | A way to get higher precision than `time` can give you is
               | to run the program under `valgrind` (on Linux). Even raw
               | valgrind will tell you exactly how many instructions it
               | ran, and in most cases that number is the same from run
               | to run, so instead of a measurement with +-3% precision
               | like `time` usually gives you when you've controlled
               | everything properly, you get a measurement typically with
               | +-0.00000001% precision. This is fantastic, far beyond
               | most things you can do in a physics, electronics, or
               | chemistry lab. It means that even with just two runs you
               | can tell if your optimization affected that number by
               | even 0.0000001%, or, more interestingly, 0.1%. And it
               | isn't affected by things like which CPU you run it on or
               | whether your laptop is unplugged, and it only slows the
               | program down about 10x.
               | 
               | Unfortunately, it measures the wrong thing, because a
               | program that runs more instructions can still be faster.
               | `valgrind --tool=cachegrind` tries to simulate your CPU's
               | cache hierarchy to give you a better idea of how long the
               | program will actually take; the `kcachegrind` GUI can
               | give you a "cycle estimation" number based on the numbers
               | it spits out. But it can still be unrelated to how fast
               | the program actually runs for a variety of reasons;
               | system calls are the biggest one. `kcachegrind` is also a
               | "profiler": it shows you how much time (simulated
               | instructions, simulated cache misses, cycle estimation,
               | whatever) is used by each subroutine in your program.
               | 
               | You might wonder why an optimization that speeds your
               | program up by 0.1% matters. The answer is that if you do
               | 106 such optimizations you have sped your program up by
               | 10%. SQLite has improved its performance by more than a
               | factor of 3 over a 10-year period by using this approach:
               | https://www.sqlite.org/cpu.html
               | 
               | There's a more basic profiler called gprof built into
               | GCC; building your program with `-pg` will add profiling
               | instrumentation in to it, so that it writes profiling
               | information into a file called `gmon.out`. After running
               | a program built with gprof, you can run `gprof
               | testprogram` to analyze `gmon.out`; it will tell you how
               | much time each subroutine took (including and excluding
               | its transitive callees) and how many times it was called.
               | 
               | Linux also has a built-in profiler called `perf_events`
               | https://perfwiki.github.io/main/ which uses hardware
               | performance counters to get numbers like cachegrind's,
               | but using the real hardware instead of simulation, and
               | including a lot more detail; it can also profile the time
               | your program spends in system calls. I've only used it in
               | very basic ways, so I don't know very much about it.
               | Intel also has a proprietary thing called VTune which can
               | do things perf can't and runs on OSes that aren't Linux,
               | but doesn't support AMD's processors.
               | 
               | -- *** --
               | 
               | I hope this is of some value to you! I'd be happy to
               | answer any other questions. Hopefully correctly.
        
               | 8dcc wrote:
               | Wow. This is probably the best comment I have ever seen
               | on HN. Thank you so much for the information, I will add
               | benchmarking to my 'libpool' project and mention it in
               | the article. I can't think of any questions right now,
               | but I will let you know if anything comes up. Thank you
               | again. :)
        
               | kragen wrote:
               | Aww, thanks! *blush*
        
               | 8dcc wrote:
               | Actually, I have one question. You said:
               | 
               | > The actual benchmark program itself can be as simple as
               | allocating and deallocating in a loop [...]
               | 
               | What do you mean, exactly? Deallocating immediately after
               | allocating? I feel like there would be a big difference
               | between that and allocating multiple blocks before
               | freeing them.
        
               | cb321 wrote:
               | You are right - there will be a big difference for these
               | two different workloads. As I am 100% sure @kragen
               | already knows, this is arguably the first, smallest step
               | in thinking for you to realize just how many possible
               | workloads you might want to model. :-) And to realize why
               | senior engineers will say pithy things like "the best
               | benchmark is your actual application".
               | 
               | While that is true (and I've said it myself), one reason
               | I like the design of a little "workload interpreter
               | shell" as in
               | https://github.com/c-blake/bst/blob/main/test/bst-
               | interact.c (or maybe better https://github.com/c-blake/ad
               | ix/blob/master/tests/btshell.ni... which has a
               | preprocessor) is that you can use them to run "isolated
               | tests" as a kind of "half step" between a pure hot loop
               | dumb thing and a full-on application with all its
               | attendant noise and self-interaction/self-disturbance.
               | 
               | So, for example, in this allocation problem setting you
               | could write a similar 30-line "shell/interpreter" that
               | interprets a binary stream of "commands" or allocation
               | requests. The origin a of said binary stream could be a
               | recording/log from some actual app you have doing
               | something actually useful. Then you could apply that one
               | recording in the "shell" in different configurations as
               | @kragen mentions to study the allocator performance (both
               | space & speed) in isolation. At that point you could
               | start to say pretty meaningful things from replaying logs
               | many times about this or that tradeoff on this|that
               | workload.
               | 
               | EDIT: Of course, at _that_ point you start to get into
               | debates about  "how representative" different workloads
               | are, but at least you could potentially share your log
               | with someone who had a different computer and have them
               | run benchmarks, too, and at least the workload is not
               | synthetic but relates to some actual possible program.
               | Moving from synthetic benchmarks to log-based ones was a
               | big thing in OS research on filesystems in the 1970s &
               | 80s and it helped the "generalizability of results" (I
               | think, anyway..I'm sure many still do only synthetic
               | benchmarking due to its convenience).
               | 
               | EDIT2: I just thought of a cute name for the log replay
               | half-step between synthetic microbenchmarks and a full
               | application - "The millibenchmark". ;-) Maybe someone
               | else has thought of this before, though.
        
               | kragen wrote:
               | The "interpreter shell" is an interesting idea!
               | 
               | Recording traces of allocation requests from actual
               | applications for allocator-benchmarking purposes
               | revolutionized allocator performance research in the late
               | 90s, if I recall correctly, in addition to the earlier
               | filesystem research you mentione. Benchmarks based on
               | those traces were the bedrock of Zorn's case that
               | generational GCs were faster _in practice_ than malloc().
               | 
               | But I think (though possibly this is motivated reasoning)
               | that benchmarking _some_ workload is good enough for many
               | optimizations. A fixed-size-block allocator is already
               | pretty restricted in what workloads you can run on it,
               | after all, and its performance characteristics are a lot
               | simpler than something like first-fit. Showing that an
               | optimization makes _some_ workload faster is infinitely
               | better than not knowing if it makes _any_ workload
               | faster. Showing that it generalizes across many workloads
               | is better, of course, but it 's not nearly the same step
               | in utility.
        
               | cb321 wrote:
               | Well, you can use it for everything. I've used it for in
               | memory hash tables, trees, etc. To be clear, for those
               | who may not understand, the "interpretation" in these
               | "millibenchmarks" for something as simple as an allocator
               | is not like Python or Bash or something slow. It could
               | literally be "mmap a binary file of integers and either
               | all/free the binary numbers". So, yeah, there might be
               | some CPU pipeline overhead in unpredictable branches and
               | you might still want to try to measure that, but beyond
               | that there is basically no work and any real workload
               | might have one-ish hard to predict branch leading to
               | allocator operations. So, even unadjusted for overhead
               | it's not so bad. And it keeps your measurements fast so
               | you can get statistically many of them even if there is
               | an "inner min loop" to try to isolate the best case or
               | something. (Well, at least if what you are measuring is
               | fast.) The downside/tradeoff is just lack of fidelity to
               | interaction effects in the full application like resource
               | competition/branch predictor disruption/etc.
               | 
               | You could probably set up a
               | https://github.com/c-blake/nio file for it (or other) if
               | you wanted to be systematic. A problem with the "text-
               | newline" part of the "unix philosophy" is that it incurs
               | a lot of unneeded binary->ascii->binary cycles that's
               | actually more programming work as well as more CPU work,
               | and it's really fairly orthogonal to the rest of the Unix
               | ideas.
               | 
               | EDIT reply to parent EDIT
               | 
               | > benchmarking some workload is good enough
               | 
               | Fair enough, except that "good enough" is sort of "famous
               | last words" if anyone is reviewing your work. :-) :-)
               | Some will complain you only did Intel not AMD or not ARM
               | or not xyz or Apple Silicon or whatever other zillion
               | variables. Something is always better than nothing,
               | though. It should be thought of more like a Bayesian
               | thing - "no surprise for untested situations" or maybe
               | "confidence in proportion to evidence/experience".
               | 
               | FWIW, I thought your big advice to 8dcc was pretty
               | decent.
        
               | kragen wrote:
               | Thanks! I agree with all of that except for the last
               | statement, which I am not competent to judge.
        
               | kragen wrote:
               | You are sure right about that! The example I linked
               | allocates 5000 blocks and makes a linked list out of
               | them, then deallocates them.
        
             | cb321 wrote:
             | Modern CPUs (with all their buffers & caches & predictor
             | warm-ups & spin-ups) can make benchmarking really tricky
             | unless you have a very specific workload in mind and even
             | then it requires care (like looking at both the edge of[1]
             | and full timing distributions[2] for best-case,
             | average/median/typical case, worst case analysis, etc.).
             | Briefly, the number of ways to get (un)lucky has
             | proliferated over the decades, as has the variety of
             | deployment platforms.
             | 
             | Given the depth of that rabbit hole, as writing advice, I
             | might suggest instead motivating via "more space efficient"
             | and refer to exactly fitting fixed-size objects where more
             | general purpose allocators will (usually)1 waste space to
             | the next power of two. Measuring less dynamic space is also
             | easier (very dynamic space like %cache used can be tricky,
             | though). Controlling scope and pedagogy in general is also
             | tricky, but I think this rhetorical advice is common
             | practice and might also double as a way to address your
             | initial misread of the topic. EDIT: he can always do
             | refinements / more analysis in follow up posts/something if
             | this library has any legs for him.
             | 
             | [1] https://github.com/c-blake/bu/blob/main/doc/tim.md
             | 
             | [2] https://github.com/c-blake/bu/blob/main/doc/edplot.md
             | 
             | 1 Sure, fancy any-size allocators can tack on a histogram
             | of maybe somewhat rounded lg sizes (to economize histo
             | space) and spawn fixed-size object sub-arenas if there are
             | enough of them, say >= enough to fill a VMem page or 2, but
             | this is rare in my experience and also complicates the
             | dispatch to subarenas that you mention above to slightly
             | fancier search. This kind of topic _rapidly_ spirals into
             | allocator-complete discussion and even systems-complete -
             | e.g. a filesystem is largely an allocator with good old MS-
             | DOS FAT being close to a  "file is a pool of disk blocks"
             | model.
        
               | kragen wrote:
               | This is an excellent comment. Thank you.
        
       | 2rsf wrote:
       | This bring old memories back, I implemented something similar on
       | a motorola 68360 ages ago. Since the size of the buffer I used
       | was not huge I skipped the pointers and simply enumerated each
       | chunk, it was remarkably efficient and stable.
        
       | codr7 wrote:
       | I usually default to slab allocators if I have to write my own:
       | 
       | https://github.com/codr7/libcbs#slab-allocation
        
       | drysine wrote:
       | Also "Arena allocator tips and tricks (nullprogram.com)" [0]
       | 
       | [0] https://news.ycombinator.com/item?id=37670740
        
       | sylware wrote:
       | It all depends on the load profile using the allocator. You never
       | know, that said you cannot beat, in theory, a semantically
       | specialized allocator for a specific load... in other words, the
       | right way(TM). This means applications should bundle their own
       | allocators for the various load types they have. The "generic"
       | allocator is sort of an heresy for the lazy, short termists or
       | those who are in hurry. Don't worry I still use such generic
       | allocator, sometimes, but often I do mmap myself the memory and
       | deal directly with it.
        
         | astrange wrote:
         | The system allocator is the one that comes with all the
         | system's memory debugging tools and security (or lack of it),
         | so you'd better not mind losing those if you want to ditch it.
        
       | voctor wrote:
       | There is a small series of posts on the BitSquid blog about
       | memory allocation which is worth reading!
       | http://bitsquid.blogspot.com/2015/08/allocation-adventures-3...
        
       | 8dcc wrote:
       | Hello, I am the author. Thank you all for the instructive
       | comments, I made some changes to the article since I first posted
       | it:
       | 
       | - Added a link to this HN post.
       | 
       | - Renamed some variables and types in the code for readability.
       | 
       | - Mention (in a footnote) that we could allocate the 'Pool'
       | structure and the chunk arrays with a single call, or even return
       | the 'Pool' on the stack. Suggested by 'liontwist' and 'unwind'.
       | 
       | - Mention (in a footnote) that we don't always need to store the
       | full address when building the linked list of free chunks.
       | Suggested by 'cb321'.
       | 
       | - Added valgrind support (also to my 'libpool' project).
       | Suggested by 'kazinator'.
        
         | cb321 wrote:
         | FWIW, while it is true that a smaller pointer lets you have
         | smaller minimum fixed size objects in your pool, as your new
         | footnote suggests, the space concern I was mostly alluding to
         | in [1] is all the references in your _client code_ { such as
         | the BST code I linked to where each binary search tree (BST)
         | node has 2 pointers (or 3 if they have parent pointers to give
         | BST nodes  "doubly linked list-like autonomy") }.
         | 
         | This can really add up. For example, like 20 years ago, I did a
         | `print sizeof` in gdb for some g++ STL map node with a data
         | payload of like 4 bytes and it was some obscene size
         | approximately >= 1 order of magnitude bigger than necessary
         | (like 80 bytes instead of, say 8 with 2Byte ptrs). While main
         | memory is cheap, CPU cache memory is not and for "linky"
         | structures, it often pays to be more cache-friendly. It's
         | pretty easy to imagine a 10X smaller thing fitting into an L2
         | CPU cache where the larger thing would not even fit in L3,
         | netting you a big latency boost (though the more root-ward
         | levels of the tree likely always stay in caches in a hot loop).
         | 
         | EDIT: Anyway, you don't have to believe me. Others did an x32
         | ABI for x86_64 mostly to have "merely 2X" smaller pointers:
         | https://en.wikipedia.org/wiki/X32_ABI
         | 
         | [1] https://news.ycombinator.com/item?id=42643410
        
         | veltas wrote:
         | Also the font isn't incredibly easy to read on Windows. I'm
         | assuming Georgia was selected? Which has been on Windows for
         | years but renders so badly it looks like the font color isn't
         | black, even though I think it is black? This isn't really your
         | fault but I would have stuck to Times New Roman for Windows'
         | sake.
        
         | tandr wrote:
         | As a last code snippet readability suggestion - maybe extract
         | "malloc" and "free" calls to your own simple wrappers (like
         | my_malloc and my_free), and do "valgrinding" inside of them?
         | You can also pass additional params to these functions to
         | specify type of allocation if needed, or make separate
         | functions.
        
       | jhallenworld wrote:
       | A sometimes useful thing is to treat the pool as a stack and
       | provide a call to free all recently allocated items up to a
       | certain index. So you make a bunch of deep subroutine calls that
       | use the pool allocator, and then on return free them all. It's
       | like a secondary automatic storage class.
       | 
       | Also sometimes useful to provide another call to promote an
       | automatic item to a more permanent one so that it is excluded
       | from being freed like this.
        
       | harrison_clarke wrote:
       | you can avoid resizing and moving the pools if you allocate
       | massive pools up front. most OSs let you overcommit memory, and
       | most programs using memory pools only have a handful of them
       | 
       | (on windows, you'd need to handle this with VirtualAlloc. osx and
       | linux let you overcommit with malloc, and handle it with a copy-
       | on-write page fault handler)
        
       | lelandbatey wrote:
       | It seems like almost all of the complexity of this allocator
       | comes from having to manage the fact that it's being implemented
       | _on top of_ the system allocator. I thought the whole point of
       | using a custom allocator was to avoid the system allocator for
       | exactly the problems they mention e.g. calling the system-
       | provided realloc() might move your stuff around, having to track
       | separate pool starts so you can call the system-provided free()
       | on them, etc.
       | 
       | Like, yeah you have to do that, but I thought the whole point of
       | writing a custom allocator was to not use the system allocator
       | beyond statically allocating a huge block of memory upfront and
       | then managing that yourself, is it not?
        
       | alecco wrote:
       | (related) LLFree
       | 
       | Paper: LLFree: Scalable and Optionally-Persistent Page-Frame
       | Allocation https://www.usenix.org/system/files/atc23-wrenger.pdf
       | 
       | Video presentation: https://www.youtube.com/watch?v=yvd3D5VOHc8
       | 
       | Implementation and benchmarks are well documented at the repos:
       | 
       | Rust repo https://github.com/luhsra/llfree-rs C repo
       | https://github.com/luhsra/llfree-c
        
       ___________________________________________________________________
       (page generated 2025-01-09 23:00 UTC)