hngopher.com

       [HN Gopher] Fast, simple, hard real time allocator for Rust
       ___________________________________________________________________
        
       Fast, simple, hard real time allocator for Rust
        
       Author : practicalrs
       Score  : 178 points
       Date   : 2024-05-01 07:39 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | lionkor wrote:
       | > Please note that offset-allocator isn't a Rust allocator
       | conforming to the GlobalAlloc trait
       | 
       | Aw. Why not?
       | 
       | > find the next available bin using 2x LZCNT instructions to make
       | all operations O(1)
       | 
       | Isn't that sort of implementation defined? The LZCNT operation
       | itself is a loop over all bits -- the fact that the number of
       | bits is constant doesn't make this O(1), it just makes it O(n)
       | where n := 64. But if it was 16 or 32 bit, it may be faster,
       | which means its not O(1), or rather, big o notation really breaks
       | down here.
        
         | exDM69 wrote:
         | > Aw. Why not?
         | 
         | Because the global allocator API is defined as free(pointer
         | address) and this used free(buffer handle).
         | 
         | It would require a reverse lookup structure from address to
         | buffer handle, e.g. red-black tree. Maintaining it would no
         | longer be O(1).
         | 
         | > the fact that the number of bits is constant doesn't make
         | this O(1), it just makes it O(n) where n := 64
         | 
         | O(n) when n is constant is equal to O(1).
         | 
         | But lzcnt uses a fixed number of clock cycles so it's not
         | really O(n) either.
        
           | orlp wrote:
           | > It would require a reverse lookup structure from address to
           | buffer handle, e.g. red-black tree. Maintaining it would no
           | longer be O(1).
           | 
           | Well, another solution is enlarging each allocation by 8
           | bytes and storing the buffer handle in front of the returned
           | allocation. So `free(ptr)` would get the handle as `*((ptr as
           | *const u64) - 1)`.
        
             | exDM69 wrote:
             | That works but...
             | 
             | This kind of allocators are usually used for suballocating
             | GPU buffers, so hiding a few bytes of metadata "in-band"
             | can mess up your alignment requirements and not all kinds
             | of GPU memory are even accessible by CPU. Due to false
             | sharing cache problems you would probably want a full cache
             | line (64 bytes) to store the metadata.
             | 
             | For a CPU-only memory allocator your idea could work quite
             | well. It can also be implemented on top of this code
             | without any modifications.
        
           | pbalcer wrote:
           | > It would require a reverse lookup structure from address to
           | buffer handle, e.g. red-black tree. Maintaining it would no
           | longer be O(1).
           | 
           | Not necessarily. If you are able to map a block of normal
           | memory in a known location relative to the GPU memory that
           | you are managing with an "offset allocator", it should be
           | possible to directly calculate the metadata location for each
           | "offset". This is how most allocators find arenas/buckets
           | (whatever you want to call them) for an allocation by a
           | pointer.
           | 
           | Something like this:
           | +-------------------------+ 0x000000 (start of managed
           | memory)       | Metadata                |        |
           | |       |                         |       | ...
           | |       +-------------------------+ 0x001000        | Padding
           | ...             |       +-------------------------+ 0x010000
           | | GPU Memory Block        |       |                         |
           | |                         |       |                         |
           | ~2MB block       |                         |       |
           | |       |                         |
           | +-------------------------+ 0x210000
           | 
           | With this layout, to get to a metadata for an allocation, all
           | you need to do is to align down the allocation pointer and
           | calculate the appropriate location in the metadata page.
           | 
           | This obviously won't work in all scenarios, but it's a simple
           | and practical way around a map lookup.
        
             | exDM69 wrote:
             | A good idea but not O(1).
             | 
             | If you have a large allocation you have to mark every
             | "page" in metadata table which makes allocation O(size in
             | bytes / page size).
             | 
             | The overhead might still be practically acceptable, even if
             | it is not constant time.
        
           | jsheard wrote:
           | GlobalAlloc is also required to be thread safe, and TLSF
           | doesn't even attempt to handle that. I suppose you could get
           | away with it on single-threaded platforms like
           | microcontrollers or (vanilla) WebAssembly but it wouldn't
           | generalize unless you wrapped it in a mutex, and then the
           | performance would be terrible.
        
             | couchand wrote:
             | Any uc worth programming on has interrupts!
        
               | TrueDuality wrote:
               | Even with interrupts there is almost always a single
               | execution unit, and thus single-threaded. An interrupt is
               | just a jump instruction to a predefined handler. The code
               | that was running is entirely stopped until the system
               | decides to resume from that point.
        
               | couchand wrote:
               | That doesn't mean that naive single-threaded code is
               | necessarily interrupt-safe. If the allocator is
               | interrupted and the interrupt service routine needs to
               | make use of the allocator's data structures for any
               | reason there could very much be a conflict.
               | 
               | This particular algorithm is focused on GPUs, so I'm not
               | clear on the applicability. However I did want to clear
               | up the idea that there are no concurrency concerns for
               | single-threaded library code, an assumption that's been
               | the source of many bugs.
        
               | loeg wrote:
               | The presence of interrupts is basically equivalent to not
               | being single threaded, though. There are true single-
               | threaded environments, it's just limited to userspace
               | processes that don't handle signals.
        
             | exDM69 wrote:
             | Correct me if I'm wrong but can't you put a
             | Rc<RefCell<OffsetAllocator>> in your std::alloc::Allocator
             | implementation for interior mutability?
             | 
             | This make a non-thread safe allocator, with the (desired)
             | side effect of making anything allocating from it (e.g.
             | Vec) a !Send.
             | 
             | Or is there a requirement that Allocators must be Sync and
             | Send?
        
               | LegionMammal978 wrote:
               | This is about GlobalAlloc, the trait that actually exists
               | in stable Rust. Only static variables can be marked as
               | the #[global_allocator], and statics are always required
               | to be Send + Sync. Of course, if you're _really_ sure
               | that only a single thread will ever exist, you can pinky-
               | promise it by putting a couple unsafe impls on the
               | allocator type. But then the language won 't protect you
               | if you do try to allocate or deallocate from multiple
               | threads.
               | 
               | (More practically, you can just write a GlobalAlloc that
               | forwards to an instance of a thread-local allocator.
               | Though that runs the risk of thread-local
               | deinitialization running in an unexpected order.)
        
           | Sharlin wrote:
           | OTOH, I think this would be a good fit for the "Store"
           | proposal [1] which uses handles rather than addresses to
           | refer to allocations.
           | 
           | [1]
           | https://github.com/matthieu-m/storage/blob/main/etc/rfc.md
        
         | orlp wrote:
         | > It just makes it O(n) where n := 64
         | 
         | That's O(1).
         | 
         | > The LZCNT operation itself is a loop over all bits
         | 
         | That's not how circuits work. You can easily make a circuit of
         | O(log n) depth that returns base-2 encoded index of the first 1
         | bit in a n-bit register.
         | 
         | But since n is just 64 here, you're looking at a circuit of
         | AND/OR depth ~8, so it's no surprise that modern CPUs can
         | compute the LZCNT / TZCNT of a 64-bit integer in a single CPU
         | cycle.
        
           | sapiogram wrote:
           | Note that rustc will not emit this instruction by default, to
           | support older CPUs. Without `target-cpu=haswell` or similar,
           | the operation will be quite slow, but still O(1) of course.
        
             | jsheard wrote:
             | It's not _that_ bad, all x86 processors have the BSR
             | instruction which can be mapped to LZCNT semantics with
             | just a few extra instructions.
             | 
             | Left to right - modern x86-64, baseline x86-64, baseline
             | i686:
             | 
             | https://rust.godbolt.org/z/nGsM35TEo
             | 
             | Maybe you're thinking of POPCNT, which hurts a lot more if
             | it doesn't compile down to the native instruction:
             | 
             | https://rust.godbolt.org/z/xcxG3v4Mn
        
           | SkiFire13 wrote:
           | > That's O(1).
           | 
           | A loop over 2^64 elements is also O(1), but I don't think
           | people are ok to label every program that can run on a 64 bit
           | pc as O(1).
        
         | eru wrote:
         | Whether you call that O(1) or O(n) depends on the level of
         | abstraction you want to talk about, and what you want to
         | analyse.
         | 
         | For example, in some sense, anything you can do on a finite
         | computer is O(1) (or alternatively, it's an infinite loop). But
         | that is a very boring sense, and seldom useful for analysis.
         | 
         | There's a formal definition for big-O notation, but most of the
         | parameters to that formal definition are only implied in casual
         | discussions.
        
         | slashdev wrote:
         | In all x86 this is an instruction that typically takes 3 clock
         | cycles (latency) with 1 clock cycle throughput. That's as
         | constant time as you get.
         | 
         | Even if that were false, it's O(k) where k is the number of
         | bits, which is constant. However, in that case it might be
         | slower than the alternatives.
        
         | jameshart wrote:
         | Any big O discussion usually breaks down (or at least, changes)
         | if you start taking into account the data size of the operands.
         | 
         | You generally assume for big O purposes, when analyzing sorts
         | for example, that comparisons of elements are constant time,
         | and that swapping elements is constant time.
         | 
         | On an n-bit computer, when dealing with m-bit elements, those
         | assumptions are broadly sound. Comparing two ints doesn't
         | depend on the ints. Reading or writing the int from memory
         | doesn't take more or less time either.
         | 
         | But the same algorithm, while it has the same big O
         | relationship to the size of a collection, might have a
         | different constant factor on different CPUs and for different
         | values of m. And you might find that some algorithms that
         | constant factor's relationship to m has its own big O.
         | 
         | One common place this can bite you is when you try to apply big
         | O analysis that has 'constant time comparison' assumption built
         | in to, say, sorting arbitrarily long strings, or inserting into
         | a hash table using arbitrary string keys. Comparing strings or
         | calculating hashes of them is not constant time.
        
           | empath-nirvana wrote:
           | In a practical sense, how you define big O depends on what
           | you consider to be the inputs to the function. If the runtime
           | doesn't change depending on the size of the inputs that you
           | care about, it's O(1).
           | 
           | Like, you might have a function with 2 inputs, N and M, and
           | the run time is like O(m2^n), but n is fixed at a low number
           | every time you'll run it in practice, so it's really just
           | O(m) for your purposes.
        
             | jameshart wrote:
             | Right. O( _f_ ( _n_ )) is literally only defined for
             | situations where _n_ 1: varies between different runs of
             | the algorithm, and 2: can grow arbitrarily large. Even
             | though in practice 'arbitrarily large' is always limited by
             | memory, storage, etc.
             | 
             | Talking about algorithms being O( _n_ ) in the number of
             | bits in a value is only reasonable if the number of bits in
             | the value actually varies between runs.
        
         | dzaima wrote:
         | Integer addition, and indeed pretty much all operations, are
         | also a loops over all bits, LZCNT is in no way at all unique
         | here.
        
       | exDM69 wrote:
       | I wrote almost the exact same thing after seeing Sebastian
       | Aaltonen's original offset allocator repo which inspired me to
       | write my own clone in Rust.
       | 
       | It's possible to further improve the fragmentation
       | characteristics if the maximum size of the memory to be allocated
       | is known ahead of time. The original used a 5.3 "floating point"
       | scheme for the free bins, but given the maximum size you can opt
       | for 4.4 or 6.2 or whatever can cover the whole region, giving
       | more dense (or sparse) allocation granularity as needed.
       | 
       | This same idea can also be extended to allocating texture atlas
       | regions when the regions have power of two size. 256 bins can
       | cover all possible texture sizes that GPUs can handle, so you can
       | get hard O(1) texture packing.
       | 
       | I'm also curious about making something like this work as a
       | general purpose allocator (as alluded to by the README). For that
       | it would be necessary to make a reverse lookup from pointer
       | address to allocation handle, which would require something like
       | a red-black tree covering the address space, which would no
       | longer be 0(1). If anyone has ideas on this front, I would be
       | happy to hear them.
       | 
       | This algorithm is so clever and kind of a game changer.
       | Allocating and freeing are fast real time operations (only dozens
       | of CPU cycles and a few memory writes). It kind of makes "bump
       | allocators" obsolete because this allocator is almost as fast but
       | supports trimming and freeing allocations as well. This algorithm
       | requires more memory but the amount is quite low.
        
         | scott_s wrote:
         | > For that it would be necessary to make a reverse lookup from
         | pointer address to allocation handle, which would require
         | something like a red-black tree covering the address space,
         | which would no longer be 0(1). If anyone has ideas on this
         | front, I would be happy to hear them.
         | 
         | A radix tree can solve this:
         | https://en.wikipedia.org/wiki/Radix_tree
         | 
         | I used one way, way back to do exactly the same thing: upon a
         | free, I needed to look up all of the metadata for an address.
         | For a 32-bit address space, you can just allocate a giant array
         | up front, and use the address as an index. For a 64-bit address
         | space, it's obviously way too big to statically allocate it up
         | front. A radix tree neatly solves the problem. An arbitrarily
         | sized radix tree is not constant lookup, but for reasons I
         | honestly cannot remember, for a 64-bit address space, it's
         | guaranteed to only be 3 levels deep.
         | 
         | See my implementation, which (I believe) I borrowed from
         | tcmalloc:
         | https://github.com/scotts/streamflow/blob/master/streamflow....
        
           | ot wrote:
           | Radix trees are not O(1), they're O(log n). Data structures
           | that support constant-time predecessor lookup do not exist,
           | there is a super-constant lower bound, even in the RAM model
           | [1].
           | 
           | Often I hear people say "well pointer size is constant, so
           | log(64) = O(1)", but that is misleading: if you assume
           | constant pointer size, any function of that is O(1) as well,
           | so any bounded algorithm is O(1). The notation just becomes
           | meaningless.
           | 
           | Asymptotic bounds must be presented and understood in
           | context.
           | 
           | [1] https://en.wikipedia.org/wiki/Predecessor_problem#Mathema
           | tic...
        
             | skybrian wrote:
             | Given that pointer size _is_ fixed, maybe asymptotic bounds
             | aren't what we should care about? Maybe it would be better
             | to ask how the cost increases just going up to 2^64 bytes
             | of address space. The graph after that point is irrelevant.
             | 
             | An O(n^2) algorithm will blow up well before that.
        
             | scott_s wrote:
             | This function is O(1): https://github.com/scotts/streamflow
             | /blob/master/streamflow..... All other tree operations are
             | also constant, because the fact that it is a 3-level tree
             | is hardcoded.
             | 
             | Asymptotic bounds are useful when your input can grow
             | arbitrarily large. When your input is fixed, the bounds
             | become less useful and you should focus on the particulars
             | of your case. In the case I'm presenting, the work done
             | traversing the tree will be the same every time; it is
             | O(1). That doesn't necessarily mean it's fast enough! It
             | still may do too much work for the use case. For instance,
             | I can imagine someone saying that the function above does
             | too much pointer chasing for their use case.
        
               | ot wrote:
               | > This function is O(1)
               | 
               | I think I addressed that in my comment, but to be more
               | explicit, this function is O(1) too:
               | size_t find_sorted(void* object, void** list, size_t
               | size) {         for (size_t i = 0; i < SIZE_MAX; ++i) {
               | if (i < size && list[i] > object) return i;         }
               | return SIZE_MAX;       }
               | 
               | If O(1) cannot distinguish your function from this
               | function, what is its informational value?
               | 
               | > Asymptotic bounds are useful when your input can grow
               | arbitrarily large
               | 
               | But your inputs can't grow arbitrarily large, that's why
               | you can hardcode 3 levels. O(1) is an asymptotic bound,
               | and my point is that it is not very informative here.
        
               | fragmede wrote:
               | This is O(n) because you're still doing the _i < size_
               | comparison, even though you've moved it out of the for
               | loop.
        
             | Dylan16807 wrote:
             | > Often I hear people say "well pointer size is constant,
             | so log(64) = O(1)", but that is misleading: if you assume
             | constant pointer size, any function of that is O(1) as
             | well, so any bounded algorithm is O(1). The notation just
             | becomes meaningless.
             | 
             | It's not _meaningless_. We 're designing for a real machine
             | here, and the actual goal is to bound the runtime to a
             | small known constant.
             | 
             | You're right that big O is not great notation for this.
             | Hell, even O(1) isn't good enough because the constant
             | factor is unspecified.
             | 
             | But the misuse of notation does not invalidate the problem.
             | Taking the worst case for a log leaves you with a feasible
             | runtime. Taking the worst case for n or n^2 or "any bounded
             | algorithm" overwhelmingly does not.
             | 
             | > Asymptotic bounds must be presented and understood in
             | context.
             | 
             | Yes, context is exactly what keeps it meaningful!
        
         | yvt wrote:
         | I did this too! Except it was based on the original TLSF paper
         | (on which this work is based) as I wasn't aware of that offset
         | allocator.
         | 
         | > I'm also curious about making something like this work as a
         | general purpose allocator (as alluded to by the README).
         | 
         | In a general purpose allocator, you can just store allocation
         | metadata next to a payload so the lookup from a pointer address
         | becomes O(1). The original TLSF algorithm (as proposed by the
         | paper) was designed this way.
         | 
         | Months later, I also wrote a general purpose implementation of
         | TLSF for Rust `GlobalAlloc`, which, with some rigorous
         | optimizations, turned out to be faster and smaller than every
         | allocator for WebAssembly and embedded environments I tested,
         | and it's still my go-to choice. It didn't fare too bad even
         | when compared against modern thread-caching allocators in
         | normal use cases. Internal fragmentation is probably worse,
         | though.
        
           | scott_s wrote:
           | > In a general purpose allocator, you can just store
           | allocation metadata next to a payload so the lookup from a
           | pointer address becomes O(1).
           | 
           | Yes, but the downside is that now allocator metadata pollutes
           | the cache. It's super efficient for the allocator, but it may
           | harm efficiency of the actual use of that memory. I believe
           | most production allocators don't use object headers for this
           | reason.
        
             | Someone wrote:
             | > I believe most production allocators don't use object
             | headers for this reason.
             | 
             | Isn't the original reason hardening against buffer
             | overflows? Overwriting allocator metadata can be used to
             | attack a system.
        
               | pcwalton wrote:
               | Yes. I believe that the standard workaround for this
               | problem is to place the allocator metadata for an entire
               | block at some fixed address boundary, so you can just
               | mask off the bottom K bits of an address to find the
               | allocator metadata for the block, and from there find the
               | metadata of the allocation you're concerned with.
        
           | ComputerGuru wrote:
           | Have you published your code to GH, by any chance?
        
         | pcwalton wrote:
         | > I'm also curious about making something like this work as a
         | general purpose allocator (as alluded to by the README).
         | 
         | Besides the reverse mapping, you'd also have to get rid of the
         | Vec use in the allocator, but that's fairly straightforward.
         | Additionally, you'd need some sort of logic about requesting
         | slabs of memory from the OS (and returning them when free),
         | which is some extra work.
        
       | joshsyn wrote:
       | How is this exactly going to be used in bevy?
        
         | CaptainOfCoit wrote:
         | At least one mention of it here
         | https://github.com/bevyengine/bevy/issues/12590
         | 
         | > Use one set of large mesh buffers per vertex attribute layout
         | 
         | > Problem: Index/vertex buffers have to be re-bound when the
         | mesh changes. This adds overhead when encoding draws and when
         | drawing. It also prevents some optimisations like being able to
         | draw all objects for shadow mapping for a light in one draw.
         | 
         | > Solution(s): [...] Use an appropriate allocator like a port
         | of Sebastian Aaltonen's offset allocator
         | [https://github.com/sebbbi/OffsetAllocator] to manage
         | allocation
         | 
         | Where the "port of Sebastian Aaltonen's offset allocator" is
         | what got linked in this HN submission.
        
         | pcwalton wrote:
         | The goal is to start storing multiple meshes in the same vertex
         | buffer/index buffer. (I already have a decent chunk of this
         | code written in a branch.) This reduces binding overhead, but
         | moreover it allows us to start using multidraw indirect where
         | supported, which should reduce drawcall counts dramatically. I
         | measured a 97% drawcall count reduction for the opaque pass on
         | the Bistro test scene, for example. The benefits are even
         | higher for shadow maps, because they use a single shader and so
         | the drawcall count should drop to the single digits in most
         | cases.
         | 
         | I've said it before, but if drawcall count is a problem in your
         | game, you should complain to the engine vendor. On modern GPUs,
         | there's no reason people should be manually optimizing for
         | drawcall count in 2024. Game engines have the tools to make it
         | a non-issue; they just need to use them.
        
       | widdershins wrote:
       | What's the difference between this and a slab allocator? Is it
       | just that the bin size distribution wastes less memory (assuming
       | the slab allocator used pow2 bin sizes)?
        
         | hinkley wrote:
         | Slab allocators don't provide real time guarantees. That is
         | arena allocators. The distinction may seem trivial but the
         | requirements are distinct.
         | 
         | In tiny systems all allocations of one type may come from a
         | single operation, but in larger systems what fits in a slab
         | will come from distinct concerns with different priorities. You
         | want a different arena for those allocations.
        
       | mgrithm wrote:
       | whats a offset allocator?
        
         | jjtheblunt wrote:
         | https://github.com/sebbbi/OffsetAllocator
         | 
         | which has links to a pertinent paper at the end of that page.
        
           | hinkley wrote:
           | That doesn't explain anything.
           | 
           | And the linked paper doesn't even contain the word 'offset'.
        
       | Ono-Sendai wrote:
       | I have a best-fit allocator for this problem (managing GPU
       | resources). It has higher CPU overhead due to using a std::map,
       | but should waste less space due to not using fixed bucket sizes.
       | 
       | https://github.com/glaretechnologies/glare-core/blob/master/...
       | https://github.com/glaretechnologies/glare-core/blob/master/...
        
       | pcwalton wrote:
       | Author here. Didn't expect this to make HN :)
       | 
       | I don't deserve any credit for this; @SebAaltonen did all the
       | work to write the C++ version. I just did a line-by-line
       | translation.
        
       | hinkley wrote:
       | Why is this hard realtime in the face of arbitrary allocations
       | and deallocations? These dots aren't connected anywhere within
       | two links distance from the given URL.
        
       ___________________________________________________________________
       (page generated 2024-05-02 23:01 UTC)