[HN Gopher] Fast, simple, hard real time allocator for Rust
___________________________________________________________________
Fast, simple, hard real time allocator for Rust
Author : practicalrs
Score : 178 points
Date : 2024-05-01 07:39 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| lionkor wrote:
| > Please note that offset-allocator isn't a Rust allocator
| conforming to the GlobalAlloc trait
|
| Aw. Why not?
|
| > find the next available bin using 2x LZCNT instructions to make
| all operations O(1)
|
| Isn't that sort of implementation defined? The LZCNT operation
| itself is a loop over all bits -- the fact that the number of
| bits is constant doesn't make this O(1), it just makes it O(n)
| where n := 64. But if it was 16 or 32 bit, it may be faster,
| which means its not O(1), or rather, big o notation really breaks
| down here.
| exDM69 wrote:
| > Aw. Why not?
|
| Because the global allocator API is defined as free(pointer
| address) and this used free(buffer handle).
|
| It would require a reverse lookup structure from address to
| buffer handle, e.g. red-black tree. Maintaining it would no
| longer be O(1).
|
| > the fact that the number of bits is constant doesn't make
| this O(1), it just makes it O(n) where n := 64
|
| O(n) when n is constant is equal to O(1).
|
| But lzcnt uses a fixed number of clock cycles so it's not
| really O(n) either.
| orlp wrote:
| > It would require a reverse lookup structure from address to
| buffer handle, e.g. red-black tree. Maintaining it would no
| longer be O(1).
|
| Well, another solution is enlarging each allocation by 8
| bytes and storing the buffer handle in front of the returned
| allocation. So `free(ptr)` would get the handle as `*((ptr as
| *const u64) - 1)`.
| exDM69 wrote:
| That works but...
|
| This kind of allocators are usually used for suballocating
| GPU buffers, so hiding a few bytes of metadata "in-band"
| can mess up your alignment requirements and not all kinds
| of GPU memory are even accessible by CPU. Due to false
| sharing cache problems you would probably want a full cache
| line (64 bytes) to store the metadata.
|
| For a CPU-only memory allocator your idea could work quite
| well. It can also be implemented on top of this code
| without any modifications.
| pbalcer wrote:
| > It would require a reverse lookup structure from address to
| buffer handle, e.g. red-black tree. Maintaining it would no
| longer be O(1).
|
| Not necessarily. If you are able to map a block of normal
| memory in a known location relative to the GPU memory that
| you are managing with an "offset allocator", it should be
| possible to directly calculate the metadata location for each
| "offset". This is how most allocators find arenas/buckets
| (whatever you want to call them) for an allocation by a
| pointer.
|
| Something like this:
| +-------------------------+ 0x000000 (start of managed
| memory) | Metadata | |
| | | | | ...
| | +-------------------------+ 0x001000 | Padding
| ... | +-------------------------+ 0x010000
| | GPU Memory Block | | |
| | | | |
| ~2MB block | | |
| | | |
| +-------------------------+ 0x210000
|
| With this layout, to get to a metadata for an allocation, all
| you need to do is to align down the allocation pointer and
| calculate the appropriate location in the metadata page.
|
| This obviously won't work in all scenarios, but it's a simple
| and practical way around a map lookup.
| exDM69 wrote:
| A good idea but not O(1).
|
| If you have a large allocation you have to mark every
| "page" in metadata table which makes allocation O(size in
| bytes / page size).
|
| The overhead might still be practically acceptable, even if
| it is not constant time.
| jsheard wrote:
| GlobalAlloc is also required to be thread safe, and TLSF
| doesn't even attempt to handle that. I suppose you could get
| away with it on single-threaded platforms like
| microcontrollers or (vanilla) WebAssembly but it wouldn't
| generalize unless you wrapped it in a mutex, and then the
| performance would be terrible.
| couchand wrote:
| Any uc worth programming on has interrupts!
| TrueDuality wrote:
| Even with interrupts there is almost always a single
| execution unit, and thus single-threaded. An interrupt is
| just a jump instruction to a predefined handler. The code
| that was running is entirely stopped until the system
| decides to resume from that point.
| couchand wrote:
| That doesn't mean that naive single-threaded code is
| necessarily interrupt-safe. If the allocator is
| interrupted and the interrupt service routine needs to
| make use of the allocator's data structures for any
| reason there could very much be a conflict.
|
| This particular algorithm is focused on GPUs, so I'm not
| clear on the applicability. However I did want to clear
| up the idea that there are no concurrency concerns for
| single-threaded library code, an assumption that's been
| the source of many bugs.
| loeg wrote:
| The presence of interrupts is basically equivalent to not
| being single threaded, though. There are true single-
| threaded environments, it's just limited to userspace
| processes that don't handle signals.
| exDM69 wrote:
| Correct me if I'm wrong but can't you put a
| Rc<RefCell<OffsetAllocator>> in your std::alloc::Allocator
| implementation for interior mutability?
|
| This make a non-thread safe allocator, with the (desired)
| side effect of making anything allocating from it (e.g.
| Vec) a !Send.
|
| Or is there a requirement that Allocators must be Sync and
| Send?
| LegionMammal978 wrote:
| This is about GlobalAlloc, the trait that actually exists
| in stable Rust. Only static variables can be marked as
| the #[global_allocator], and statics are always required
| to be Send + Sync. Of course, if you're _really_ sure
| that only a single thread will ever exist, you can pinky-
| promise it by putting a couple unsafe impls on the
| allocator type. But then the language won 't protect you
| if you do try to allocate or deallocate from multiple
| threads.
|
| (More practically, you can just write a GlobalAlloc that
| forwards to an instance of a thread-local allocator.
| Though that runs the risk of thread-local
| deinitialization running in an unexpected order.)
| Sharlin wrote:
| OTOH, I think this would be a good fit for the "Store"
| proposal [1] which uses handles rather than addresses to
| refer to allocations.
|
| [1]
| https://github.com/matthieu-m/storage/blob/main/etc/rfc.md
| orlp wrote:
| > It just makes it O(n) where n := 64
|
| That's O(1).
|
| > The LZCNT operation itself is a loop over all bits
|
| That's not how circuits work. You can easily make a circuit of
| O(log n) depth that returns base-2 encoded index of the first 1
| bit in a n-bit register.
|
| But since n is just 64 here, you're looking at a circuit of
| AND/OR depth ~8, so it's no surprise that modern CPUs can
| compute the LZCNT / TZCNT of a 64-bit integer in a single CPU
| cycle.
| sapiogram wrote:
| Note that rustc will not emit this instruction by default, to
| support older CPUs. Without `target-cpu=haswell` or similar,
| the operation will be quite slow, but still O(1) of course.
| jsheard wrote:
| It's not _that_ bad, all x86 processors have the BSR
| instruction which can be mapped to LZCNT semantics with
| just a few extra instructions.
|
| Left to right - modern x86-64, baseline x86-64, baseline
| i686:
|
| https://rust.godbolt.org/z/nGsM35TEo
|
| Maybe you're thinking of POPCNT, which hurts a lot more if
| it doesn't compile down to the native instruction:
|
| https://rust.godbolt.org/z/xcxG3v4Mn
| SkiFire13 wrote:
| > That's O(1).
|
| A loop over 2^64 elements is also O(1), but I don't think
| people are ok to label every program that can run on a 64 bit
| pc as O(1).
| eru wrote:
| Whether you call that O(1) or O(n) depends on the level of
| abstraction you want to talk about, and what you want to
| analyse.
|
| For example, in some sense, anything you can do on a finite
| computer is O(1) (or alternatively, it's an infinite loop). But
| that is a very boring sense, and seldom useful for analysis.
|
| There's a formal definition for big-O notation, but most of the
| parameters to that formal definition are only implied in casual
| discussions.
| slashdev wrote:
| In all x86 this is an instruction that typically takes 3 clock
| cycles (latency) with 1 clock cycle throughput. That's as
| constant time as you get.
|
| Even if that were false, it's O(k) where k is the number of
| bits, which is constant. However, in that case it might be
| slower than the alternatives.
| jameshart wrote:
| Any big O discussion usually breaks down (or at least, changes)
| if you start taking into account the data size of the operands.
|
| You generally assume for big O purposes, when analyzing sorts
| for example, that comparisons of elements are constant time,
| and that swapping elements is constant time.
|
| On an n-bit computer, when dealing with m-bit elements, those
| assumptions are broadly sound. Comparing two ints doesn't
| depend on the ints. Reading or writing the int from memory
| doesn't take more or less time either.
|
| But the same algorithm, while it has the same big O
| relationship to the size of a collection, might have a
| different constant factor on different CPUs and for different
| values of m. And you might find that some algorithms that
| constant factor's relationship to m has its own big O.
|
| One common place this can bite you is when you try to apply big
| O analysis that has 'constant time comparison' assumption built
| in to, say, sorting arbitrarily long strings, or inserting into
| a hash table using arbitrary string keys. Comparing strings or
| calculating hashes of them is not constant time.
| empath-nirvana wrote:
| In a practical sense, how you define big O depends on what
| you consider to be the inputs to the function. If the runtime
| doesn't change depending on the size of the inputs that you
| care about, it's O(1).
|
| Like, you might have a function with 2 inputs, N and M, and
| the run time is like O(m2^n), but n is fixed at a low number
| every time you'll run it in practice, so it's really just
| O(m) for your purposes.
| jameshart wrote:
| Right. O( _f_ ( _n_ )) is literally only defined for
| situations where _n_ 1: varies between different runs of
| the algorithm, and 2: can grow arbitrarily large. Even
| though in practice 'arbitrarily large' is always limited by
| memory, storage, etc.
|
| Talking about algorithms being O( _n_ ) in the number of
| bits in a value is only reasonable if the number of bits in
| the value actually varies between runs.
| dzaima wrote:
| Integer addition, and indeed pretty much all operations, are
| also a loops over all bits, LZCNT is in no way at all unique
| here.
| exDM69 wrote:
| I wrote almost the exact same thing after seeing Sebastian
| Aaltonen's original offset allocator repo which inspired me to
| write my own clone in Rust.
|
| It's possible to further improve the fragmentation
| characteristics if the maximum size of the memory to be allocated
| is known ahead of time. The original used a 5.3 "floating point"
| scheme for the free bins, but given the maximum size you can opt
| for 4.4 or 6.2 or whatever can cover the whole region, giving
| more dense (or sparse) allocation granularity as needed.
|
| This same idea can also be extended to allocating texture atlas
| regions when the regions have power of two size. 256 bins can
| cover all possible texture sizes that GPUs can handle, so you can
| get hard O(1) texture packing.
|
| I'm also curious about making something like this work as a
| general purpose allocator (as alluded to by the README). For that
| it would be necessary to make a reverse lookup from pointer
| address to allocation handle, which would require something like
| a red-black tree covering the address space, which would no
| longer be 0(1). If anyone has ideas on this front, I would be
| happy to hear them.
|
| This algorithm is so clever and kind of a game changer.
| Allocating and freeing are fast real time operations (only dozens
| of CPU cycles and a few memory writes). It kind of makes "bump
| allocators" obsolete because this allocator is almost as fast but
| supports trimming and freeing allocations as well. This algorithm
| requires more memory but the amount is quite low.
| scott_s wrote:
| > For that it would be necessary to make a reverse lookup from
| pointer address to allocation handle, which would require
| something like a red-black tree covering the address space,
| which would no longer be 0(1). If anyone has ideas on this
| front, I would be happy to hear them.
|
| A radix tree can solve this:
| https://en.wikipedia.org/wiki/Radix_tree
|
| I used one way, way back to do exactly the same thing: upon a
| free, I needed to look up all of the metadata for an address.
| For a 32-bit address space, you can just allocate a giant array
| up front, and use the address as an index. For a 64-bit address
| space, it's obviously way too big to statically allocate it up
| front. A radix tree neatly solves the problem. An arbitrarily
| sized radix tree is not constant lookup, but for reasons I
| honestly cannot remember, for a 64-bit address space, it's
| guaranteed to only be 3 levels deep.
|
| See my implementation, which (I believe) I borrowed from
| tcmalloc:
| https://github.com/scotts/streamflow/blob/master/streamflow....
| ot wrote:
| Radix trees are not O(1), they're O(log n). Data structures
| that support constant-time predecessor lookup do not exist,
| there is a super-constant lower bound, even in the RAM model
| [1].
|
| Often I hear people say "well pointer size is constant, so
| log(64) = O(1)", but that is misleading: if you assume
| constant pointer size, any function of that is O(1) as well,
| so any bounded algorithm is O(1). The notation just becomes
| meaningless.
|
| Asymptotic bounds must be presented and understood in
| context.
|
| [1] https://en.wikipedia.org/wiki/Predecessor_problem#Mathema
| tic...
| skybrian wrote:
| Given that pointer size _is_ fixed, maybe asymptotic bounds
| aren't what we should care about? Maybe it would be better
| to ask how the cost increases just going up to 2^64 bytes
| of address space. The graph after that point is irrelevant.
|
| An O(n^2) algorithm will blow up well before that.
| scott_s wrote:
| This function is O(1): https://github.com/scotts/streamflow
| /blob/master/streamflow..... All other tree operations are
| also constant, because the fact that it is a 3-level tree
| is hardcoded.
|
| Asymptotic bounds are useful when your input can grow
| arbitrarily large. When your input is fixed, the bounds
| become less useful and you should focus on the particulars
| of your case. In the case I'm presenting, the work done
| traversing the tree will be the same every time; it is
| O(1). That doesn't necessarily mean it's fast enough! It
| still may do too much work for the use case. For instance,
| I can imagine someone saying that the function above does
| too much pointer chasing for their use case.
| ot wrote:
| > This function is O(1)
|
| I think I addressed that in my comment, but to be more
| explicit, this function is O(1) too:
| size_t find_sorted(void* object, void** list, size_t
| size) { for (size_t i = 0; i < SIZE_MAX; ++i) {
| if (i < size && list[i] > object) return i; }
| return SIZE_MAX; }
|
| If O(1) cannot distinguish your function from this
| function, what is its informational value?
|
| > Asymptotic bounds are useful when your input can grow
| arbitrarily large
|
| But your inputs can't grow arbitrarily large, that's why
| you can hardcode 3 levels. O(1) is an asymptotic bound,
| and my point is that it is not very informative here.
| fragmede wrote:
| This is O(n) because you're still doing the _i < size_
| comparison, even though you've moved it out of the for
| loop.
| Dylan16807 wrote:
| > Often I hear people say "well pointer size is constant,
| so log(64) = O(1)", but that is misleading: if you assume
| constant pointer size, any function of that is O(1) as
| well, so any bounded algorithm is O(1). The notation just
| becomes meaningless.
|
| It's not _meaningless_. We 're designing for a real machine
| here, and the actual goal is to bound the runtime to a
| small known constant.
|
| You're right that big O is not great notation for this.
| Hell, even O(1) isn't good enough because the constant
| factor is unspecified.
|
| But the misuse of notation does not invalidate the problem.
| Taking the worst case for a log leaves you with a feasible
| runtime. Taking the worst case for n or n^2 or "any bounded
| algorithm" overwhelmingly does not.
|
| > Asymptotic bounds must be presented and understood in
| context.
|
| Yes, context is exactly what keeps it meaningful!
| yvt wrote:
| I did this too! Except it was based on the original TLSF paper
| (on which this work is based) as I wasn't aware of that offset
| allocator.
|
| > I'm also curious about making something like this work as a
| general purpose allocator (as alluded to by the README).
|
| In a general purpose allocator, you can just store allocation
| metadata next to a payload so the lookup from a pointer address
| becomes O(1). The original TLSF algorithm (as proposed by the
| paper) was designed this way.
|
| Months later, I also wrote a general purpose implementation of
| TLSF for Rust `GlobalAlloc`, which, with some rigorous
| optimizations, turned out to be faster and smaller than every
| allocator for WebAssembly and embedded environments I tested,
| and it's still my go-to choice. It didn't fare too bad even
| when compared against modern thread-caching allocators in
| normal use cases. Internal fragmentation is probably worse,
| though.
| scott_s wrote:
| > In a general purpose allocator, you can just store
| allocation metadata next to a payload so the lookup from a
| pointer address becomes O(1).
|
| Yes, but the downside is that now allocator metadata pollutes
| the cache. It's super efficient for the allocator, but it may
| harm efficiency of the actual use of that memory. I believe
| most production allocators don't use object headers for this
| reason.
| Someone wrote:
| > I believe most production allocators don't use object
| headers for this reason.
|
| Isn't the original reason hardening against buffer
| overflows? Overwriting allocator metadata can be used to
| attack a system.
| pcwalton wrote:
| Yes. I believe that the standard workaround for this
| problem is to place the allocator metadata for an entire
| block at some fixed address boundary, so you can just
| mask off the bottom K bits of an address to find the
| allocator metadata for the block, and from there find the
| metadata of the allocation you're concerned with.
| ComputerGuru wrote:
| Have you published your code to GH, by any chance?
| pcwalton wrote:
| > I'm also curious about making something like this work as a
| general purpose allocator (as alluded to by the README).
|
| Besides the reverse mapping, you'd also have to get rid of the
| Vec use in the allocator, but that's fairly straightforward.
| Additionally, you'd need some sort of logic about requesting
| slabs of memory from the OS (and returning them when free),
| which is some extra work.
| joshsyn wrote:
| How is this exactly going to be used in bevy?
| CaptainOfCoit wrote:
| At least one mention of it here
| https://github.com/bevyengine/bevy/issues/12590
|
| > Use one set of large mesh buffers per vertex attribute layout
|
| > Problem: Index/vertex buffers have to be re-bound when the
| mesh changes. This adds overhead when encoding draws and when
| drawing. It also prevents some optimisations like being able to
| draw all objects for shadow mapping for a light in one draw.
|
| > Solution(s): [...] Use an appropriate allocator like a port
| of Sebastian Aaltonen's offset allocator
| [https://github.com/sebbbi/OffsetAllocator] to manage
| allocation
|
| Where the "port of Sebastian Aaltonen's offset allocator" is
| what got linked in this HN submission.
| pcwalton wrote:
| The goal is to start storing multiple meshes in the same vertex
| buffer/index buffer. (I already have a decent chunk of this
| code written in a branch.) This reduces binding overhead, but
| moreover it allows us to start using multidraw indirect where
| supported, which should reduce drawcall counts dramatically. I
| measured a 97% drawcall count reduction for the opaque pass on
| the Bistro test scene, for example. The benefits are even
| higher for shadow maps, because they use a single shader and so
| the drawcall count should drop to the single digits in most
| cases.
|
| I've said it before, but if drawcall count is a problem in your
| game, you should complain to the engine vendor. On modern GPUs,
| there's no reason people should be manually optimizing for
| drawcall count in 2024. Game engines have the tools to make it
| a non-issue; they just need to use them.
| widdershins wrote:
| What's the difference between this and a slab allocator? Is it
| just that the bin size distribution wastes less memory (assuming
| the slab allocator used pow2 bin sizes)?
| hinkley wrote:
| Slab allocators don't provide real time guarantees. That is
| arena allocators. The distinction may seem trivial but the
| requirements are distinct.
|
| In tiny systems all allocations of one type may come from a
| single operation, but in larger systems what fits in a slab
| will come from distinct concerns with different priorities. You
| want a different arena for those allocations.
| mgrithm wrote:
| whats a offset allocator?
| jjtheblunt wrote:
| https://github.com/sebbbi/OffsetAllocator
|
| which has links to a pertinent paper at the end of that page.
| hinkley wrote:
| That doesn't explain anything.
|
| And the linked paper doesn't even contain the word 'offset'.
| Ono-Sendai wrote:
| I have a best-fit allocator for this problem (managing GPU
| resources). It has higher CPU overhead due to using a std::map,
| but should waste less space due to not using fixed bucket sizes.
|
| https://github.com/glaretechnologies/glare-core/blob/master/...
| https://github.com/glaretechnologies/glare-core/blob/master/...
| pcwalton wrote:
| Author here. Didn't expect this to make HN :)
|
| I don't deserve any credit for this; @SebAaltonen did all the
| work to write the C++ version. I just did a line-by-line
| translation.
| hinkley wrote:
| Why is this hard realtime in the face of arbitrary allocations
| and deallocations? These dots aren't connected anywhere within
| two links distance from the given URL.
___________________________________________________________________
(page generated 2024-05-02 23:01 UTC)