[HN Gopher] Writing a simple pool allocator in C
___________________________________________________________________
Writing a simple pool allocator in C
Author : 8dcc
Score : 254 points
Date : 2025-01-05 23:07 UTC (3 days ago)
(HTM) web link (8dcc.github.io)
(TXT) w3m dump (8dcc.github.io)
| pajko wrote:
| ThreadX has pool allocators: https://github.com/eclipse-
| threadx/threadx/blob/master/commo...
| kazinator wrote:
| Everyone and his dog who has programmed C for a few decades has
| several as well. :)
| liontwist wrote:
| The pool struct Is two pointers, why are you allocating it with
| Malloc?
| o11c wrote:
| Some people are allergic to `main() { Foo foo; foo_init(&foo);
| }`
|
| Admittedly, sometimes for good reason - it locks in the ABI.
| MathMonkeyMan wrote:
| Is that why some structs have "char reserved[16]" data
| members? I think I saw this in NGINX, though that might have
| been to allow module compatibility between the open source
| offering and the paid version.
| o11c wrote:
| Yes, that slightly improves ABI flexibility, though the
| fact that callers can still access fields limits that.
|
| An alternative is to make the _only_ member `char
| opaque[16]` (IIRC some locale-related thing in glibc does
| this, also I think libpng went through a transition of some
| sort related to this?), but calculating the size can be
| difficult outside of trivial cases since you can 't use
| sizeof/offsetof.
| liontwist wrote:
| ABI? Better make it an anonymous struct too.
|
| It's a performance oriented custom memory allocator.
| kazinator wrote:
| Mainly just the size and alignment. If you make the structure
| oversized, and with strict alignment, it can be future
| proofed. Old binary clients will provide a structure big
| enough and aligned enough for the new version.
|
| The dynamic constructor API can allocate the structure
| exactly sized without the padding.
| unwind wrote:
| Next question: why are there two calls to `malloc()`, one for
| the Pool structure and one for the Chunks?
|
| It is trivial to co-allocate them which removes the risk of
| memory fragmentation and is just generally a good idea if
| you're chasing performance.
|
| The answer might be "for clarity/simplicity" which I guess is
| fine in an informative article, but it could at least be
| highlighted in the text which I didn't see.
| 8dcc wrote:
| I will also take note of this, you are right.
| jdnxbdn wrote:
| It is certainly not trivial and would add little to the topic
| kevin_thibedeau wrote:
| He's targeting ANSI C, C89. Flexible array members are not
| supported officially until C99. 26 years later, it's time to
| stop clinging to outdated standards and crusty tooling. Even
| Linux had to read the room and adopt C11.
|
| A C11 implementation could go one step further and use
| _Alignas(max_align_t) to keep the pool array aligned with no
| manual effort. The double allocation does this implicitly.
| liontwist wrote:
| The pool needs to align every allocation anyway.
| lyu07282 wrote:
| Couldn't this also just syscall mmap directly? I mean malloc
| itself is a memory allocator it feels a bit strange to use it
| in an allocator implementation, but perhaps I'm missing
| something.
| keyle wrote:
| The parent article is quite interesting and slightly different
| being cpp [1]. There is also a post about writing a memory
| allocator [2].
|
| [1]: http://dmitrysoshnikov.com/compilers/writing-a-pool-
| allocato...
|
| [2]: http://dmitrysoshnikov.com/compilers/writing-a-memory-
| alloca...
| 10000truths wrote:
| I recommend against using linked lists for bookkeeping the free
| blocks. It seems to be the data structure that every malloc/free
| implementation reaches for, and I don't know why - the slowness
| of pointer-chasing makes it terrible for almost any real-world
| use case. A balanced tree would be a much better idea, given that
| all the essential operations would take O(log n) time instead of
| O(n). Even if one insists on a linear search, a bitset is much
| more cache friendly than pointer-chasing and it can trivially
| benefit from SIMD optimizations.
| ben-schaaf wrote:
| I don't think you understand how the allocator in the article
| works. Allocating and freeing are already O(1), creating and
| closing the allocator are necessarily O(n). There is no search
| being done here.
| o11c wrote:
| For the `Chunk` list, this isn't one of the cases where linked
| lists are harmful. Each use _only_ touches the top of the
| stack, never iterates. Also, linked lists are much easier to
| make thread-safe.
|
| For the `LinkedPtr` list, the bad case is only hit during the
| destruction of the pool, and then only if you allocate a lot of
| memory. And given the overhead of deallocation I'm not sure the
| array approach would measurably help.
|
| I don't see anywhere a binary tree search would be useful here,
| since there are no loops used for lookup (on allocation they're
| pushed in order, but when freed chunks are pushed in arbitrary
| order; this does mean no double-free protection).
| harrison_clarke wrote:
| why would you need to iterate on destruction? you free the
| whole pool at once, since you allocate it all at once
| elchananHaas wrote:
| The reason linked lists are used is for large enough
| allocations, there is no overhead. You use the space the
| application isn't using. In addition, if all allocations are
| the same size it is O(1), you just look at the head of the
| list.
|
| More sophisticated strategies bucket allocations by size, this
| has fixed overhead. You can also use balanced trees for more
| memory efficiency, but this is slower.
|
| For small allocations (8 bytes) that are too small to contain
| pointers allocators will allocate a block and use bitsets.
| JJOmeo wrote:
| Pool allocators don't walk the list or search anything though.
| All interactions are only at the list head and O(1), as all
| free nodes are just that, free and equal.
| ceving wrote:
| Then it might be a misnomber. Calling it "stack" instead of
| "list" might be better.
| alextingle wrote:
| That is a fair comment.
| cb321 wrote:
| While the most precise naming might be "a singly linked
| list representation of a stack", "free list"
| (https://en.wikipedia.org/wiki/Free_list) is an ancient
| term of art for this problem.. so much so that wikipedia
| (at least the version today) even suggests "freelist" all
| one word as a spelling.
|
| The initial linking together of all free space (also
| cache friendly, btw) is often called "threading the free
| list", although this terminology is less standard than
| "free list" itself.
| MathMonkeyMan wrote:
| Which pool operation is made more costly by the linked list?
| Both allocate and free are constant time, and trivial too.
|
| The only thing that I can imagine being faster is a bump
| allocator, but that precludes free.
| kazinator wrote:
| Suppose the information did happen to search through the free
| blocks. Suppose you put them into an array instead of a linked
| list. They can't actually be in the array, you see? The blocks
| are in the pool heap wherever they happen to be. So the array
| has to point to them. As you walk through the array you have to
| do reference the pointers. The only way it's better is that
| they are not dependent loads. But you have this array to manage
| now.
| ch33zer wrote:
| Some interesting things that would be interesting to add to this:
|
| - thread safety ( not too hard to add on top of your liked list)
| - variable sized allocations. Ideally something more akin to
| malloc. Could be done wastefully by rounding up to the nearest
| chunk size - (as mentioned in the article) variable chunk sizes -
| zeroing of memory before handing it back to users - double free
| detection or even better full asan support
| 8dcc wrote:
| I don't currently have much experience with thread safety, but
| it's something I should look into. Thank you.
| kazinator wrote:
| If you write your own allocator in C, do yourself a favor and use
| the valgrind API inside it. Its use can be conditional so you can
| "compile it out". Or should I say it has to be, so you can build
| the code in an environment where there's no valgrind API.
| gopalv wrote:
| +1 - this isn't even that hard, it is include a single header +
| annotate allocations/frees, with some redzones around the
| actual allocation to know when the user-code is writing into
| the pool pointer areas.
|
| https://developers.redhat.com/articles/2022/03/23/use-valgri...
| hooli_gan wrote:
| https://en.wikipedia.org/wiki/Red_zone_(computing)
| dsp_person wrote:
| oof love on this page how scrolling hijacks the back button
| 8dcc wrote:
| I have never heard about this (using valgrind with a "non-
| standard allocator"), but it looks really interesting,
| specially for other projects I am working on. I will definitely
| look into it.
| mananaysiempre wrote:
| You can also integrate with AddressSanitizer to some extent,
| look into[1] ASAN_{UN,}POISON_MEMORY_REGION from
| <sanitizer/asan_interface.h> or the underlying intrinsics
| __asan_{un,}poison_memory_region.
|
| [1] https://github.com/google/sanitizers/wiki/AddressSanitize
| rMa...
| kazinator wrote:
| I integrated the TXR Lisp garbage collector with valgrind.
| That came in handy on a number of occasions.
|
| It's not always the case that you want special behavior for
| the garbage collector to assist with debugging, but also
| this: if you want to use Valgrind to debug anything that does
| _not_ have to do with the garbage collector, you want all the
| false positives generated by it out of your way.
|
| Some of the scanning that a conservative garbage collector
| has to do looks like access errors to Valgrind. By using the
| Valgrind API, you can grant yourself access to that memory,
| and then lock it again when done.
|
| That aspect doesn't necessarily apply to a pool allocator
| being described in this article; that should not be
| triggering any false positives.
| 8dcc wrote:
| (In case you missed my other post, I ended up adding
| valgrind support in the article and in the 'libpool'
| project.)
|
| > I integrated the TXR Lisp garbage collector with
| valgrind.
|
| That's funny, because when I said "for other projects I am
| working on", I specifically meant a Lisp interpreter which
| uses a memory pool and garbage collection (I am still
| working on implementing this last part, anyway). I am very
| happy about the fact that I can continue using valgrind in
| this project.
| kragen wrote:
| Thanks for the tip!
| tavianator wrote:
| I've replaced Valgrind with sanitizers for most of my workflow.
| Using the sanitizer APIs in your custom allocator is also easy:
| https://github.com/tavianator/bfs/blob/main/src/sanity.h
| https://github.com/tavianator/bfs/blob/31dffd6441ba80ea998ce...
| lsllc wrote:
| The Address Sanitizer (ASAN) in both clang and gcc is/are
| excellent:
|
| https://github.com/google/sanitizers/wiki/addresssanitizer
|
| I believe to integrate a custom allocator with ASAN, you need
| to look at the memory "poisoning" mechanisms:
|
| https://github.com/google/sanitizers/wiki/AddressSanitizerMa...
| accelbred wrote:
| Note that using this will likely violate strict aliasing due to
| using the void pointer returned as one type, freeing it, and then
| getting that same chunk again and using it as another type.
| You'll probably want to compile with -fno-strict-aliasing to be
| safe.
|
| A good reference on strict aliasing:
| https://gist.github.com/shafik/848ae25ee209f698763cffee272a5...
| leni536 wrote:
| That's what allocators do. If C's object model doesn't allow
| users of the language to write their own allocators then that
| object model is broken.
|
| C++ relatively has fixes to allow allocators to work, it
| requires calls to std::launder.
| dwattttt wrote:
| I understand the C standard hides such actions behind a wall
| of "this is the allocator", and expected the compiler authors
| to also be the allocator authors, allowing them to know
| when/how they can break such rules (in the context of their
| own compiler)
| unwind wrote:
| No, allocators are not magic (in that regard) in C. There
| is nothing out of the ordinary going on with an allocator,
| the parent comment is simply mistaken (as pointed out by
| another answer).
| dwattttt wrote:
| Ah, I see that it's because the char type never violates
| Strict Aliasing. I was wondering how you could define a
| type such as Chunk, yet hand out a pointer to it to a
| user who will cast it to some other type.
| SleepyMyroslav wrote:
| Well the pool code fails to use char* type to write
| inside block of memory.
|
| If you look at 'pool_free' code you can see that it
| receives used block from user as 'ptr' then casts it to
| 'Chunk pointer' and writes to into Chunk member value of
| type 'Chunk pointer'. I had to change that line to memcpy
| when I did it last time around 15 years ago in C++ with
| GCC 4.something. In short if you are writing your own
| allocators you need either to disable strict aliasing
| like Linux does or do the dance of 'memcpy' when you
| access memory as 'internal allocator structures' right
| after it was used as user type. When it happened to me
| write was reordered and I observed code that writes into
| Chunk* next executed first and write of zeros into user
| type second.
| uecker wrote:
| Note that C++ has different rules than C. C has type
| changing stores that do not require memcpy for allocated
| storage (i.e. no declared type). Older compilers have
| plenty of bugs related to TBAA though. Newer versions of
| GCC should be ok.
| pjmlp wrote:
| Which is why pointer provenance is an issue as well.
| alextingle wrote:
| There's no reason an allocator should ever trip over strict
| aliasing rules.
|
| malloc() returns an address. The user code writes a value, and
| then reads a value of the same type. No problem.
|
| Later, malloc() returns the same address. The user code writes
| a value of another type then reads a value of that same type.
| Again, no problem. There's no aliasing going on here.
|
| The act of writing a value of a different type tells the
| compiler that the lifetime of the previous object has ended.
| There's no special magic required.
|
| Strict aliasing is about situations where we write a value of
| one type, and then attempt to read a value of a different type
| from the same location. If you want to do that, then you have
| to be extremely cautious. But memory allocators don't do that,
| so it's not an issue.
|
| (Obviously just talking about C here, or POD in C++ terms. If
| you are dealing with C++ objects with destructors, then you
| have an extra layer of complexity to deal with, but again
| that's nothing to do with aliasing.)
| clairebyte wrote:
| >The act of writing a value of a different type tells the
| compiler that the lifetime of the previous object has ended.
|
| afaik only memcpy has that magic property, so I think parent
| is almost correct. void *p = malloc(n);
| *(int *)p = 42; // ok, *p is now an int. //*(float *)p
| = 3.14f; // I think this is not allowed, p points to an int
| object, regular stores do not change effective type
| float x = 3.14f; memcpy(p, &x, sizeof(float)); // but
| this is fine, *p now has effective type float
|
| So in the new, pool_new:
| pool->chunk_arr[i].next = &pool->chunk_arr[i + 1];
|
| This sets the effect type of the chunk block to 'Chunk'
|
| Later in pool_alloc: Chunk* result =
| pool->free_chunk; ... return result;
|
| _result has effective type 'Chunk'
|
| In user code: int *x = pool_alloc(); *x
| = 42; // aliasing violation, *x has effective type 'Chunk'
| but tried to access it as an int*
|
| User code would need to look like this: int
| *x = pool_alloc(); memcpy(x, &(int){0}, sizeof(int));
| // establish new effective type as 'int' // now we can
| do *x = 42;**
|
| _
| bigpingo wrote:
| And this is why type based alias analysis (TBAA) is insane
| and why projects like linux complies with fno-strict-
| aliasing.
|
| C should issue a defect report and get rid of that nonsense
| from the standard.
| astrange wrote:
| C doesn't have "alias analysis" in the standard. It has
| an (informally specified) memory model which has "memory
| objects" which have a single type, which means treating
| them as a different type is undefined behavior.
|
| This enables security analysis like valgrind/ASan and
| secure hardware like MTE/CHERI so it's very important and
| you can't get rid of it.
|
| However, it's not possible to implement malloc() in C
| because malloc() is defined as returning new "memory
| objects" and there is no C operation which creates
| "memory objects" except malloc() itself. So it only works
| as long as you can't see into the implementation, or if
| the compiler gives you special forgiveness somehow.
|
| C++ has such an operation called "placement new", so you
| want something like that.
| kevin_thibedeau wrote:
| You can definitely implement malloc in C. It does nothing
| special in its most basic form but cough up void pointers
| into its own arena.
|
| It gets complicated when you have virtual memory and an
| OS involved but even then you can override the system
| malloc with a simple implementation that allocates from a
| large static array.
| oguz-ismail wrote:
| > aliasing violation, *x has effective type 'Chunk'
|
| This doesn't make any sense. How do you know its effective
| type if you don't have access to the definition of
| `pool_alloc()'?
| clairebyte wrote:
| If you can guarantee its always compiled in a separate TU
| and never inlined, sure, might be a practical way so
| 'solve' this issue, but if you then do some LTO (or do a
| unity build or something) the compiler might suddenly
| break your code. Another way is to add an inline asm
| block with "memory" clobber to escape the pointer so the
| optimizer can't destroy your code.
|
| It's really quite ridiculous that compiler implementers
| have managed to overzealously nitpick the C standard so
| that you can't implement a memory allocator in C.
| astrange wrote:
| > It's really quite ridiculous that compiler implementers
| have managed to overzealously nitpick the C standard so
| that you can't implement a memory allocator in C.
|
| This is good because it's also what gives you valgrind
| and CHERI. Take one away and you can't have the other.
|
| (Because if you define all undefined behavior, then
| programs will rely on it and you can't assume it's an
| error anymore.)
| uecker wrote:
| Any type-changing store to allocated storage has this
| property in C.
| astrange wrote:
| There are other issues besides changing the memory type.
| For instance, C has those rules about out of bounds
| pointers being undefined, but you can't implement that - if
| you return part of the pool and someone calculates an out
| of bounds address they're getting a valid address to the
| rest of the pool. That's why you can't implement malloc()
| in C.
|
| (The difference here is that system malloc() works with
| valgrind, -fbounds-safety, theoretical secure hardware with
| bounds checking etc., and this one doesn't.)
| gpderetta wrote:
| > The act of writing a value of a different type tells the
| compiler that the lifetime of the previous object has ended.
| There's no special magic required.
|
| I think strictly speaking in C this is only true for
| anonymous memory, i.e. memory you got from some allocator-
| like function. So if your custom allocator gets its memory
| from malloc (or sbrk, or mmap), everything is fine, but, for
| example, you are allocating from a static char array, it is
| formally UB.
|
| In C++ you can use placement new to change the type of
| anything.
| uecker wrote:
| Yes, type-changing stores are guaranteed by the C standard
| to work only for allocated storage. In practice no compiler
| makes a difference between declared and allocated storage.
| A bigger problem is that Clang does not implement the C
| rules as specified..
| 8fingerlouie wrote:
| A long time ago, in a galaxy far far away, i wrote something
| similar for an embedded platform. I was implementing a WAP
| Browser at the time (yes, WAP,
| https://en.wikipedia.org/wiki/Wireless_Application_Protocol), and
| we needed dynamic memory allocation in an environment where
| everything was static due to realtime constraints.
|
| So i ended up with this : https://github.com/jinie/memmgr
| guardian5x wrote:
| The Latin Modern font on this website does not look so good in
| Chrome, is it because of the font size, or is it just me?
| (Chrome)
| johnisgood wrote:
| I was going to say this. Looks awful to me as well. It is the
| font-family.
| dmonitor wrote:
| Sounds like a Windows issue based on other people in the
| thread. I'm on Firefox with the same issue. Removing Latin-
| modern from the CSS file made it much more readable for me.
| 9029 wrote:
| Another (admittedly less portable) way to solve the resizing
| problem is to reserve a large virtual memory space using linux
| mmap with the PROT_NONE flag and then commit pages as needed
| using mprotect with PROT_READ|PROT_WRITE flags. I believe
| windows' VirtualAlloc/VirtualProtect also allows this
| jeffbee wrote:
| You don't need to do anything with the flags. You can just mmap
| a huge virtual area and the pages will be realized whenever you
| first touch them.
| 9029 wrote:
| Today I learned, thanks
| cb321 wrote:
| One aspect people don't tend to mention (and TFA does not because
| it has "general purpose" as a scope) about doing your own
| allocation arenas is that because a "pointer" is just the name
| (really number) of an allocated region, this also affords
| flexibility in the width of those numbers aka "pointer size" that
| decides your "address" space.
|
| So, for example, if you are willing to limit yourself to 64
| KiNodes of equal sized objects, a pool like
| https://github.com/c-blake/bst/blob/main/POOL.c can use an
| unsigned short 2-byte pointer. This small address space
| specialization can give a full 4x space overhead optimization
| over modern 64-bit pointers.
|
| You could even use 8x less with 1-byte pointers if 256 nodes is
| enough or use bit fields or equivalents for even more fine-
| grained address space size selection. Admittedly, linked
| structures are usually less interesting for very small address
| space sizes, but this is sometimes useful. People may complain
| about arbitrary limits that relate to this kind of optimization,
| but the linked part _could_ be "small by design" without
| compromising overall scale { such as for structure _within_ a
| B-tree node or other multi-level indexing setups }.
|
| In any event, I think it is useful to highlight pointer size
| arbitrariness to junior engineers. It is often learned once &
| immediately forgotten behind an onslaught of general purpose-ness
| (& pointer type systems/syntax). Relocation ideas also relate.
| E.g., if all the pointers are implicitly indices into a (small?)
| array, and all the code needs a "base address" of that array to
| get a VMem address, then you can relocate the whole bundle of
| objects pretty easily changing only the base address.
| 8dcc wrote:
| Yes, this is true. I didn't use that in the code because I
| thought it would make it less readable, though. I might mention
| it somewhere, thank you.
| cb321 wrote:
| In many prog.lang's, it absolutely makes things less
| readable. This is not an accident, but because the PL itself
| specifically tries to syntactically support pointer types
| (i.e. to literally aid readability). Most PLs do so only for
| general purpose VMem pointers with user-defined pointers
| being afterthoughts (at best). You probably need (at least!)
| operator overloading to get the same level of readability as
| "PL-native" pointers { though "readability" is a bit
| subjective and powerful PL features can be divisive; Scare
| "quotes" for interpretation hazard :-) }.
|
| Anyway, you probably knew all that & I don't mean to suggest
| otherwise, but it made sense to write since HN loves PL talk.
| kps wrote:
| 'Handles are the better pointers'1 is a related writeup that
| has been discussed here before2.
|
| 1 https://floooh.github.io/2018/06/17/handles-vs-pointers.html
|
| 2 https://news.ycombinator.com/item?id=36419739
| matheusmoreira wrote:
| This is really informative. I wonder if there is a better
| solution for the reallocation problem, one that preserves
| pointers.
| jll29 wrote:
| This code is very beautifully written, thanks for sharing.
|
| However, you should consult the book "C: Interfaces and
| Implementations" by Dave Hanson, which has a library called
| "Arena" that is very similar to your "Pool"s, and it shows a few
| more tricks to make the code safer and easier to read (Chapters
| 5+6, 6 in particular).
|
| D. Hanson's book (written in the 'literary programming' style
| invented by Donal Knuth) also demonstrates having debugging and
| production implementations for the same C API to catch memory
| errors, and his book is from 1997: > Copyright
| (c) 1997 by David R. Hanson
|
| (He used to be a SE professor I Princeton before getting hired by
| Microsoft, I believe.)
| 8dcc wrote:
| Thank you so much! I will definitely check it out. I also
| planned on writing an article about arena allocators soon.
| kragen wrote:
| I think you mean "literate programming".
|
| The arena allocator in Hanson's chapter 6 is an arena
| allocator, not a "pool allocator" in the sense that 8dcc means,
| which is to say, a constant-time fixed-size-block allocator. It
| seems that you had the same reading comprehension problem that
| I did.
|
| The difference is that Hanson's arena allocator can make
| allocations of any size (his Arena_alloc takes an nbytes
| argument) and cannot deallocate individual allocations, only
| the entire arena (his Arena_free takes only an arena pointer as
| a parameter). A fixed-size-block allocator like the one we are
| discussing here can only make allocations of a fixed size, but
| can recycle them individually, not only all at once.
|
| (If Hanson's name sounds familiar, it might be because you've
| read the lcc book.)
|
| It may be worth noting that Donal _d_ Knuth also invented the
| font used in 8dcc 's article.
| actionfromafar wrote:
| Honestly, "literary programming" does better explain how
| Knuth writes code.
| kragen wrote:
| Yes, but that name lacks the benefit of implicitly
| describing other approaches to programming as "illiterate".
| actionfromafar wrote:
| Haha, never thought about that!
| kragen wrote:
| Knuth did.
| kevin_thibedeau wrote:
| Reading the Metafont source is pretty challenging because
| of all the transclusions. There's a reason why programs
| aren't written like journal papers.
| kragen wrote:
| I think we can probably do better than METAFONT with the
| benefit of hindsight. Unit tests would help with
| comprehensibility. Including example outputs, too, like
| Darius Bacon's "halp", or R-Markdown, Jupyter, doctest,
| and org-mode can do. Higher-level languages than Pascal
| would probably be an improvement. Maybe building up the
| program as a series of versions, like Kartik Agaram's
| idea of "layers"; I did something similar but less
| complete in my literate-programming system "handaxeweb".
| We can do decent typography on a computer monitor now, so
| in the worst case, we can attempt to leverage
| interactivity, as so-called "explorable explanations" do.
|
| Both programs and journal papers are mostly written the
| way they are because of tradition, not because it's the
| best way we've found to write them. Two examples:
|
| Nobody would intentionally design a programming language
| to make it impossible to include the kind of diagrams
| that illustrate this article, but most programming
| languages are in fact designed in such a way.
|
| Buckminster-Fuller-style "ventilated prose" turns out to
| be more readable than traditional paragraph formatting in
| psychology experiments, which is why we use it
| universally in our programs. But you can't publish a
| paper formatted that way unless you call it a poem.
| johnisgood wrote:
| My minor nitpick regarding the code is the choice of variable
| names, but other than that, it is indeed easy to read and it is
| how I implement data structures and whatnot.
|
| I like that there is a more or less exhaustive blog post about
| it which happens to look nice as well.
| cb321 wrote:
| Kernighan & Ritchie's "The C Programming Language" also has an
| example implementation of malloc() & free() { for any-size
| requests as @kragen notes in sibling } in it in like 3 pages
| (though, sure, it is far from the most efficient
| implementation). Since it's the last section of the main text,
| I'd guess that most readers don't get that far (even though
| it's a short book).
| IgorPartola wrote:
| Even if you don't program in C, I think K&R should be
| required reading for anyone who writes software for a living.
| You can finish it in an afternoon, or two days if you
| actually write and run the code outlined in the book.
|
| My bucket list includes an item to write a novel efficient
| implementation of malloc/free. I have taken several stabs at
| this in the past but have never been happy with it. I have no
| real need of a custom allocator but it is something that I
| think would be very cool to have under my belt. This was
| inspired by K&R.
| exDM69 wrote:
| Here's another interesting O(1) memory allocator but with
| arbitrary sized allocations and low fragmentation. Negative side
| is relatively high memory overhead (a few dozen bytes per
| allocation).
|
| This kind of allocator is often used to suballocate GPU memory in
| game and graphics applications.
|
| I'm using a variant of this algorithm with added support for
| shrinkable and aligned allocations and flexible bin sizing.
|
| You can also extend this idea to two dimensions to create texture
| atlases, which is possible in O(1) for power of two sized
| allocations.
|
| Original: https://github.com/sebbbi/OffsetAllocator Rust port:
| https://crates.io/crates/offset-allocator
| kragen wrote:
| This is a very nice article! The diagrams in particular are very
| clear. (They seem to have been done in draw.io:
| https://github.com/8dcc/8dcc.github.io/blob/main/img/pool-al...
| which seems to be a pretty decent Apache-2.0 free software
| licensed diagram editor: https://github.com/jgraph/drawio/)
|
| I think it would be improved further if it began with the calling
| interface callers use to invoke the allocator, because it's
| easier to understand the implementation of some service such as
| pool allocation when you begin by knowing what service, exactly,
| is required. I began with the assumption that the author was
| describing an arena allocator under another name, rather than a
| fixed-size block allocator, both because the APR's influential
| arena allocator is called "apr_pool_t" and because one of the
| main headings in the table of contents (a greatly appreciated
| feature, like all the typography of the article) was "Closing the
| pool". It took me quite a while to figure out that I was
| mistaken. It's plausible that I only had this problem because I
| am an unusually stupid, ignorant, or pigheaded person (some would
| say all three), but, if not, many other people will have the same
| difficulty I had of taking several minutes to figure out what the
| topic of the article is. (I could have saved myself by clicking
| either of the first two links in the Introduction, but I didn't
| because I thought I already knew what it was.)
|
| The current top-voted comment on this thread is by someone who
| had the same reading comprehension problem I did, but never
| realized their erorr:
| https://news.ycombinator.com/item?id=42643951 and nobody had
| pointed out their mistaek until I did just now. So I think we can
| say with some confidence that many HN readers will have the same
| difficulty.
|
| I think that the article would also be improved by a few other
| small additions:
|
| - It says, "The pool allocator, however, is much faster than
| malloc", but doesn't provide any benchmarks or even an order of
| magnitude.
|
| - Probably when one benchmarks the implementation given, one will
| find that it is only slightly faster than jemalloc or even
| gnumalloc, because they both dispatch small allocations to a
| fixed-block allocator similar to this one, and the dispatch isn't
| very expensive. This can be fixed by inlining the allocation fast
| path and the deallocation function, which will then compile down
| to a few instructions each. A simple demonstration of how to do
| this is in my arena allocator kmregion in
| http://canonical.org/~kragen/sw/dev3/kmregion.h and
| http://canonical.org/~kragen/sw/dev3/kmregion.c (GPLv3+). The
| current implementation of 8dcc libpool is guaranteed to be unable
| to do this unless you're using link-time optimization or a so-
| called "unity" or "amalgamation" build process.
|
| - It's worth distinguishing between average-case performance (in
| which this allocator is _slightly_ better than malloc) and worst-
| case performance (in which case malloc, generally speaking, isn
| 't even in the running).
|
| - It may be worth mentioning that often such pool allocators are
| used in conjunction with initializing the objects in the pool to
| a valid state when the pool is created; that way you don't have
| to reinitialize objects to a valid state every time you
| deallocate and reallocate them, which is usually more work than
| malloc does to find the right free list. This does require not
| overwriting the user data with your freelist pointer, so it's a
| space/time tradeoff.
|
| - Maybe you should mention alignment, especially now that even on
| amd64 GCC has taken to using new instructions that require
| alignment.
|
| Even as it is, though, it's highly educational, visually
| pleasant, and very understandable (once I got over my initial
| misconception of having a totally wrong idea of what the article
| was about).
| 8dcc wrote:
| > The diagrams in particular are very clear. (They seem to have
| been done in draw.io:
| https://github.com/8dcc/8dcc.github.io/blob/main/img/pool-al...
| which seems to be a pretty decent Apache-2.0 free software
| licensed diagram editor: https://github.com/jgraph/drawio/)
|
| Yes, this is true. Furthermore, the diagram information for
| draw.io is included in the SVG images themselves, so one could
| edit them there.
|
| > It says, "The pool allocator, however, is much faster than
| malloc", but doesn't provide any benchmarks or even an order of
| magnitude.
|
| You are right, I was referring to "malloc-like" allocators
| (i.e. for variable-size blocks). It would be a good idea to
| benchmark them.
| kragen wrote:
| In general when you're working on an optimization (and this
| is an optimization) benchmarking it is essential. About a
| third of the optimizations I try end up making my code slower
| instead of faster, often because of performance
| considerations I hadn't thought of. You can't just say "it's
| much faster" without measuring.
| 8dcc wrote:
| Yes, you are right. Do you have any suggestions on how I
| should do the benchmarking? I am not sure how this is
| usually done, but if the output is some kind of graph, I
| would also like to know how these graphs are usually
| produced.
| kragen wrote:
| I'm no expert on this, so probably someone else can give
| you better advice, but I'm happy to share what I do know.
| Here are the most important points.
|
| Benchmarking is a full-fledged scientific experiment. You
| have a hypothesis that a particular change that you can
| make, such as using a different allocator, will have a
| particular effect, such as making your program run in
| less time. So you make _just that one single_ change and
| measure the results, while carefully controlling all
| other conditions that could _also_ affect your program 's
| run time. You have to do the test more than once in order
| to find out whether you've succeeded in controlling them.
| If you do a good enough job controlling the conditions of
| the experiment, it becomes _reproducible_ and therefore
| falsifiable. Programming usually isn 't science, but
| benchmarking is. A great thing about it is that each run
| of the experiment can take a fraction of a second, so you
| can do the whole scientific method from beginning to end
| in a few minutes and get a reproducible, falsifiable
| result.
|
| Benchmarking can get arbitrarily sophisticated, but the
| most important thing is to dodge the bullets that can
| turn benchmarks into nonsense, rather than produce very
| sophisticated graphs or get very high precision.
|
| The simplest thing on Linux or macOS is to compile a
| command-line executable you can run in a terminal and run
| time ./testprogram
|
| at least three times to get an idea of how long the
| wallclock time is and how variable it is. (The user and
| system times shown are not very reliable anymore.)
| Presumably there is also some way to do the same thing on
| Microsoft Windows. If you're on a laptop, check to see if
| it makes a difference if you're plugged in. Also, check
| whether the result from time
| ./testprogram; time ./testprogram
|
| is consistent; sometimes lags in CPU speed scaling can
| produce unpredictable results.
|
| Linux is not very consistent, so you should expect
| variations in the range of 1-10% from run to run when
| your program takes on the order of a second to run.
|
| If you get consistently shorter wallclock times after
| recompiling the program with a different allocator and
| the same compiler options (probably make sure
| optimization isn't completely turned off), you can
| probably attribute that difference to the different
| allocator. It's a good idea to switch back to the
| original allocator and verify that you can reproduce your
| original results to make sure you didn't accidentally
| change something else at the same time (maybe your laptop
| went into low-power mode because the battery is getting
| low, or maybe you changed the number of iterations in a
| loop and forgot you changed it, or maybe Firefox loaded a
| YouTube video in the background and is making the machine
| swap, that kind of thing).
|
| It's useful to ensure that you aren't swapping and that
| your CPU load other than the benchmark is minimal. `top`
| (or, better, `htop`) and `vmstat 5` (or, better, `dstat
| 5`) can be helpful here. Unless your program is heavily
| multithreaded, the CPU load is not as crucial as it used
| to be now that we all have multicore processors.
|
| It's helpful to check the particular version of the
| program you're benchmarking into Git, including the
| Makefile or whatever specifies the compiler options. That
| way when you write down your results you can refer to the
| specific Git commit hash. Also, record your compiler
| version, your operating system version, and your CPU.
| Ideally someone who isn't you should be able to read your
| notes on the benchmark, run it again, and get the same
| results, if they have access to the same hardware.
|
| The actual benchmark program itself can be as simple as
| allocating and deallocating in a loop, ideally inside
| another loop to run the whole benchmark multiple times;
| but on processors with cache (which includes every PC
| since the early 90s) the working set size can be very
| important, and you may get significantly different
| results for allocating 20 kilobytes 64 bytes at a time
| and 20 gigabytes 64 bytes at a time. If you pass in the
| number of iterations for one of the loops on the command
| line, instead of hardcoding it in the program code, it's
| easy to start with a small iteration count and work your
| way up a factor of 10 at a time until the program takes a
| few seconds to run. In this case, also be sure to record
| the specific command line you were measuring the
| performance of in your notes; knowing that a program took
| 2.205-2.216 seconds to run is of no value if you don't
| know if it was running a thousand iterations or a
| million.
|
| It's a good idea to actually compute something that you
| output using the memory you're allocating and
| deallocating, just in case GCC's optimizer decides that
| your benchmark loop has no effect and therefore can be
| safely removed from your program. (In
| http://canonical.org/~kragen/sw/dev3/kmregion_example.c I
| computed the sum of the data value stored in each node of
| the 24th of all the 5000-node linked lists I allocated
| and deallocated. That's because, when I read the
| disassembly of my test code with `objdump -d
| kmregion_example`, I found out that GCC had removed most
| of the code inside the test loop because it could prove
| that nobody ever read the struct fields I was
| initializing. Reading disassembly is neither necessary
| nor sufficient for dodging all bullets that invalidate
| benchmarks.)
|
| Varying the number of iterations in both the innermost
| and outermost loop should produce roughly proportional
| increases and decreases in wallclock time. (A helpful
| technique is to subtract the time for 1000 iterations
| from the time for 1001000 iterations to get the time for
| the last million iterations.) If it doesn't, you're not
| measuring what you think you're measuring. For example,
| especially with JIT compilers, you may be measuring
| startup overhead. That bit me this week with
| http://canonical.org/~kragen/sw/dev3/hadamardsin.js,
| which I erroneously reported was slower to run than the
| LuaJIT version. It turns out that it's slightly faster to
| _run_ , just much slower to _compile_.
|
| Running the benchmark on more than one computer is
| helpful because sometimes changes that speed things up on
| one system slow them down on another. I remember a
| recursive backtracking search program for logic circuit
| optimization I wrote in C long ago which used longjmp()
| to terminate the search when it found a solution; it ran
| reasonably fast on Linux but very slowly on my friend's
| Macintosh because MacOS implemented setjmp() as
| sigsetjmp(), transforming it from a very cheap operation
| into a very expensive one.
|
| That's all you need to know to reliably do optimizations.
| In fact, that's probably more than you need to know.
| Everything below this point is extra.
|
| -- *** --
|
| Allocators are especially tricky to benchmark because
| often, by changing where data is stored in memory, they
| profoundly affect the speed of the rest of your program,
| in ways that vary from program to program. More memory
| fragmentation, for example, can increase how many TLB
| misses your program runs into. I don't know what to do
| about that except try to benchmark some real programs as
| well as simple nested loops.
|
| In cases where you're trying to measure an effect that's
| roughly the size of your measurement precision,
| statistical tests can be useful.
| https://github.com/c-blake/bu/blob/main/doc/edplot.md
| linked by cb321 above discusses using the 2-sample
| Kolmogorov-Smirnov test, which is not nearly as
| terrifying as it sounds. But most effects will be either
| much larger than your measurement error, in which case
| the result is obvious enough that you don't need the
| statistics, or much smaller, in which case all they'll
| tell you is that you can't be sure about the direction of
| the effect, much less its magnitude. In medicine, where
| your experiments take years and people die in them, using
| sophisticated statistics to make the most of your
| precious data is very important. In performance
| benchmarking, where your experiments take milliseconds,
| it's usually better to improve your measurement precision
| instead. Sometimes just running the same benchmark more
| times is enough, but there are more powerful techniques
| available.
|
| A way to get higher precision than `time` can give you is
| to run the program under `valgrind` (on Linux). Even raw
| valgrind will tell you exactly how many instructions it
| ran, and in most cases that number is the same from run
| to run, so instead of a measurement with +-3% precision
| like `time` usually gives you when you've controlled
| everything properly, you get a measurement typically with
| +-0.00000001% precision. This is fantastic, far beyond
| most things you can do in a physics, electronics, or
| chemistry lab. It means that even with just two runs you
| can tell if your optimization affected that number by
| even 0.0000001%, or, more interestingly, 0.1%. And it
| isn't affected by things like which CPU you run it on or
| whether your laptop is unplugged, and it only slows the
| program down about 10x.
|
| Unfortunately, it measures the wrong thing, because a
| program that runs more instructions can still be faster.
| `valgrind --tool=cachegrind` tries to simulate your CPU's
| cache hierarchy to give you a better idea of how long the
| program will actually take; the `kcachegrind` GUI can
| give you a "cycle estimation" number based on the numbers
| it spits out. But it can still be unrelated to how fast
| the program actually runs for a variety of reasons;
| system calls are the biggest one. `kcachegrind` is also a
| "profiler": it shows you how much time (simulated
| instructions, simulated cache misses, cycle estimation,
| whatever) is used by each subroutine in your program.
|
| You might wonder why an optimization that speeds your
| program up by 0.1% matters. The answer is that if you do
| 106 such optimizations you have sped your program up by
| 10%. SQLite has improved its performance by more than a
| factor of 3 over a 10-year period by using this approach:
| https://www.sqlite.org/cpu.html
|
| There's a more basic profiler called gprof built into
| GCC; building your program with `-pg` will add profiling
| instrumentation in to it, so that it writes profiling
| information into a file called `gmon.out`. After running
| a program built with gprof, you can run `gprof
| testprogram` to analyze `gmon.out`; it will tell you how
| much time each subroutine took (including and excluding
| its transitive callees) and how many times it was called.
|
| Linux also has a built-in profiler called `perf_events`
| https://perfwiki.github.io/main/ which uses hardware
| performance counters to get numbers like cachegrind's,
| but using the real hardware instead of simulation, and
| including a lot more detail; it can also profile the time
| your program spends in system calls. I've only used it in
| very basic ways, so I don't know very much about it.
| Intel also has a proprietary thing called VTune which can
| do things perf can't and runs on OSes that aren't Linux,
| but doesn't support AMD's processors.
|
| -- *** --
|
| I hope this is of some value to you! I'd be happy to
| answer any other questions. Hopefully correctly.
| 8dcc wrote:
| Wow. This is probably the best comment I have ever seen
| on HN. Thank you so much for the information, I will add
| benchmarking to my 'libpool' project and mention it in
| the article. I can't think of any questions right now,
| but I will let you know if anything comes up. Thank you
| again. :)
| kragen wrote:
| Aww, thanks! *blush*
| 8dcc wrote:
| Actually, I have one question. You said:
|
| > The actual benchmark program itself can be as simple as
| allocating and deallocating in a loop [...]
|
| What do you mean, exactly? Deallocating immediately after
| allocating? I feel like there would be a big difference
| between that and allocating multiple blocks before
| freeing them.
| cb321 wrote:
| You are right - there will be a big difference for these
| two different workloads. As I am 100% sure @kragen
| already knows, this is arguably the first, smallest step
| in thinking for you to realize just how many possible
| workloads you might want to model. :-) And to realize why
| senior engineers will say pithy things like "the best
| benchmark is your actual application".
|
| While that is true (and I've said it myself), one reason
| I like the design of a little "workload interpreter
| shell" as in
| https://github.com/c-blake/bst/blob/main/test/bst-
| interact.c (or maybe better https://github.com/c-blake/ad
| ix/blob/master/tests/btshell.ni... which has a
| preprocessor) is that you can use them to run "isolated
| tests" as a kind of "half step" between a pure hot loop
| dumb thing and a full-on application with all its
| attendant noise and self-interaction/self-disturbance.
|
| So, for example, in this allocation problem setting you
| could write a similar 30-line "shell/interpreter" that
| interprets a binary stream of "commands" or allocation
| requests. The origin a of said binary stream could be a
| recording/log from some actual app you have doing
| something actually useful. Then you could apply that one
| recording in the "shell" in different configurations as
| @kragen mentions to study the allocator performance (both
| space & speed) in isolation. At that point you could
| start to say pretty meaningful things from replaying logs
| many times about this or that tradeoff on this|that
| workload.
|
| EDIT: Of course, at _that_ point you start to get into
| debates about "how representative" different workloads
| are, but at least you could potentially share your log
| with someone who had a different computer and have them
| run benchmarks, too, and at least the workload is not
| synthetic but relates to some actual possible program.
| Moving from synthetic benchmarks to log-based ones was a
| big thing in OS research on filesystems in the 1970s &
| 80s and it helped the "generalizability of results" (I
| think, anyway..I'm sure many still do only synthetic
| benchmarking due to its convenience).
|
| EDIT2: I just thought of a cute name for the log replay
| half-step between synthetic microbenchmarks and a full
| application - "The millibenchmark". ;-) Maybe someone
| else has thought of this before, though.
| kragen wrote:
| The "interpreter shell" is an interesting idea!
|
| Recording traces of allocation requests from actual
| applications for allocator-benchmarking purposes
| revolutionized allocator performance research in the late
| 90s, if I recall correctly, in addition to the earlier
| filesystem research you mentione. Benchmarks based on
| those traces were the bedrock of Zorn's case that
| generational GCs were faster _in practice_ than malloc().
|
| But I think (though possibly this is motivated reasoning)
| that benchmarking _some_ workload is good enough for many
| optimizations. A fixed-size-block allocator is already
| pretty restricted in what workloads you can run on it,
| after all, and its performance characteristics are a lot
| simpler than something like first-fit. Showing that an
| optimization makes _some_ workload faster is infinitely
| better than not knowing if it makes _any_ workload
| faster. Showing that it generalizes across many workloads
| is better, of course, but it 's not nearly the same step
| in utility.
| cb321 wrote:
| Well, you can use it for everything. I've used it for in
| memory hash tables, trees, etc. To be clear, for those
| who may not understand, the "interpretation" in these
| "millibenchmarks" for something as simple as an allocator
| is not like Python or Bash or something slow. It could
| literally be "mmap a binary file of integers and either
| all/free the binary numbers". So, yeah, there might be
| some CPU pipeline overhead in unpredictable branches and
| you might still want to try to measure that, but beyond
| that there is basically no work and any real workload
| might have one-ish hard to predict branch leading to
| allocator operations. So, even unadjusted for overhead
| it's not so bad. And it keeps your measurements fast so
| you can get statistically many of them even if there is
| an "inner min loop" to try to isolate the best case or
| something. (Well, at least if what you are measuring is
| fast.) The downside/tradeoff is just lack of fidelity to
| interaction effects in the full application like resource
| competition/branch predictor disruption/etc.
|
| You could probably set up a
| https://github.com/c-blake/nio file for it (or other) if
| you wanted to be systematic. A problem with the "text-
| newline" part of the "unix philosophy" is that it incurs
| a lot of unneeded binary->ascii->binary cycles that's
| actually more programming work as well as more CPU work,
| and it's really fairly orthogonal to the rest of the Unix
| ideas.
|
| EDIT reply to parent EDIT
|
| > benchmarking some workload is good enough
|
| Fair enough, except that "good enough" is sort of "famous
| last words" if anyone is reviewing your work. :-) :-)
| Some will complain you only did Intel not AMD or not ARM
| or not xyz or Apple Silicon or whatever other zillion
| variables. Something is always better than nothing,
| though. It should be thought of more like a Bayesian
| thing - "no surprise for untested situations" or maybe
| "confidence in proportion to evidence/experience".
|
| FWIW, I thought your big advice to 8dcc was pretty
| decent.
| kragen wrote:
| Thanks! I agree with all of that except for the last
| statement, which I am not competent to judge.
| kragen wrote:
| You are sure right about that! The example I linked
| allocates 5000 blocks and makes a linked list out of
| them, then deallocates them.
| cb321 wrote:
| Modern CPUs (with all their buffers & caches & predictor
| warm-ups & spin-ups) can make benchmarking really tricky
| unless you have a very specific workload in mind and even
| then it requires care (like looking at both the edge of[1]
| and full timing distributions[2] for best-case,
| average/median/typical case, worst case analysis, etc.).
| Briefly, the number of ways to get (un)lucky has
| proliferated over the decades, as has the variety of
| deployment platforms.
|
| Given the depth of that rabbit hole, as writing advice, I
| might suggest instead motivating via "more space efficient"
| and refer to exactly fitting fixed-size objects where more
| general purpose allocators will (usually)1 waste space to
| the next power of two. Measuring less dynamic space is also
| easier (very dynamic space like %cache used can be tricky,
| though). Controlling scope and pedagogy in general is also
| tricky, but I think this rhetorical advice is common
| practice and might also double as a way to address your
| initial misread of the topic. EDIT: he can always do
| refinements / more analysis in follow up posts/something if
| this library has any legs for him.
|
| [1] https://github.com/c-blake/bu/blob/main/doc/tim.md
|
| [2] https://github.com/c-blake/bu/blob/main/doc/edplot.md
|
| 1 Sure, fancy any-size allocators can tack on a histogram
| of maybe somewhat rounded lg sizes (to economize histo
| space) and spawn fixed-size object sub-arenas if there are
| enough of them, say >= enough to fill a VMem page or 2, but
| this is rare in my experience and also complicates the
| dispatch to subarenas that you mention above to slightly
| fancier search. This kind of topic _rapidly_ spirals into
| allocator-complete discussion and even systems-complete -
| e.g. a filesystem is largely an allocator with good old MS-
| DOS FAT being close to a "file is a pool of disk blocks"
| model.
| kragen wrote:
| This is an excellent comment. Thank you.
| 2rsf wrote:
| This bring old memories back, I implemented something similar on
| a motorola 68360 ages ago. Since the size of the buffer I used
| was not huge I skipped the pointers and simply enumerated each
| chunk, it was remarkably efficient and stable.
| codr7 wrote:
| I usually default to slab allocators if I have to write my own:
|
| https://github.com/codr7/libcbs#slab-allocation
| drysine wrote:
| Also "Arena allocator tips and tricks (nullprogram.com)" [0]
|
| [0] https://news.ycombinator.com/item?id=37670740
| sylware wrote:
| It all depends on the load profile using the allocator. You never
| know, that said you cannot beat, in theory, a semantically
| specialized allocator for a specific load... in other words, the
| right way(TM). This means applications should bundle their own
| allocators for the various load types they have. The "generic"
| allocator is sort of an heresy for the lazy, short termists or
| those who are in hurry. Don't worry I still use such generic
| allocator, sometimes, but often I do mmap myself the memory and
| deal directly with it.
| astrange wrote:
| The system allocator is the one that comes with all the
| system's memory debugging tools and security (or lack of it),
| so you'd better not mind losing those if you want to ditch it.
| voctor wrote:
| There is a small series of posts on the BitSquid blog about
| memory allocation which is worth reading!
| http://bitsquid.blogspot.com/2015/08/allocation-adventures-3...
| 8dcc wrote:
| Hello, I am the author. Thank you all for the instructive
| comments, I made some changes to the article since I first posted
| it:
|
| - Added a link to this HN post.
|
| - Renamed some variables and types in the code for readability.
|
| - Mention (in a footnote) that we could allocate the 'Pool'
| structure and the chunk arrays with a single call, or even return
| the 'Pool' on the stack. Suggested by 'liontwist' and 'unwind'.
|
| - Mention (in a footnote) that we don't always need to store the
| full address when building the linked list of free chunks.
| Suggested by 'cb321'.
|
| - Added valgrind support (also to my 'libpool' project).
| Suggested by 'kazinator'.
| cb321 wrote:
| FWIW, while it is true that a smaller pointer lets you have
| smaller minimum fixed size objects in your pool, as your new
| footnote suggests, the space concern I was mostly alluding to
| in [1] is all the references in your _client code_ { such as
| the BST code I linked to where each binary search tree (BST)
| node has 2 pointers (or 3 if they have parent pointers to give
| BST nodes "doubly linked list-like autonomy") }.
|
| This can really add up. For example, like 20 years ago, I did a
| `print sizeof` in gdb for some g++ STL map node with a data
| payload of like 4 bytes and it was some obscene size
| approximately >= 1 order of magnitude bigger than necessary
| (like 80 bytes instead of, say 8 with 2Byte ptrs). While main
| memory is cheap, CPU cache memory is not and for "linky"
| structures, it often pays to be more cache-friendly. It's
| pretty easy to imagine a 10X smaller thing fitting into an L2
| CPU cache where the larger thing would not even fit in L3,
| netting you a big latency boost (though the more root-ward
| levels of the tree likely always stay in caches in a hot loop).
|
| EDIT: Anyway, you don't have to believe me. Others did an x32
| ABI for x86_64 mostly to have "merely 2X" smaller pointers:
| https://en.wikipedia.org/wiki/X32_ABI
|
| [1] https://news.ycombinator.com/item?id=42643410
| veltas wrote:
| Also the font isn't incredibly easy to read on Windows. I'm
| assuming Georgia was selected? Which has been on Windows for
| years but renders so badly it looks like the font color isn't
| black, even though I think it is black? This isn't really your
| fault but I would have stuck to Times New Roman for Windows'
| sake.
| tandr wrote:
| As a last code snippet readability suggestion - maybe extract
| "malloc" and "free" calls to your own simple wrappers (like
| my_malloc and my_free), and do "valgrinding" inside of them?
| You can also pass additional params to these functions to
| specify type of allocation if needed, or make separate
| functions.
| jhallenworld wrote:
| A sometimes useful thing is to treat the pool as a stack and
| provide a call to free all recently allocated items up to a
| certain index. So you make a bunch of deep subroutine calls that
| use the pool allocator, and then on return free them all. It's
| like a secondary automatic storage class.
|
| Also sometimes useful to provide another call to promote an
| automatic item to a more permanent one so that it is excluded
| from being freed like this.
| harrison_clarke wrote:
| you can avoid resizing and moving the pools if you allocate
| massive pools up front. most OSs let you overcommit memory, and
| most programs using memory pools only have a handful of them
|
| (on windows, you'd need to handle this with VirtualAlloc. osx and
| linux let you overcommit with malloc, and handle it with a copy-
| on-write page fault handler)
| lelandbatey wrote:
| It seems like almost all of the complexity of this allocator
| comes from having to manage the fact that it's being implemented
| _on top of_ the system allocator. I thought the whole point of
| using a custom allocator was to avoid the system allocator for
| exactly the problems they mention e.g. calling the system-
| provided realloc() might move your stuff around, having to track
| separate pool starts so you can call the system-provided free()
| on them, etc.
|
| Like, yeah you have to do that, but I thought the whole point of
| writing a custom allocator was to not use the system allocator
| beyond statically allocating a huge block of memory upfront and
| then managing that yourself, is it not?
| alecco wrote:
| (related) LLFree
|
| Paper: LLFree: Scalable and Optionally-Persistent Page-Frame
| Allocation https://www.usenix.org/system/files/atc23-wrenger.pdf
|
| Video presentation: https://www.youtube.com/watch?v=yvd3D5VOHc8
|
| Implementation and benchmarks are well documented at the repos:
|
| Rust repo https://github.com/luhsra/llfree-rs C repo
| https://github.com/luhsra/llfree-c
___________________________________________________________________
(page generated 2025-01-09 23:00 UTC)