https://blog.sulami.xyz/posts/what-is-in-a-rust-allocator/

Weak Opinions, Strongly Held
Feed * Uses * About

What is in a Rust Allocator?

Published on 2024-05-06
Tags: rust

Recently I was reading the documentation for jemalloc and wondered
what actually goes into a custom memory allocator for Rust. To add
jemalloc, or any other custom allocator, to a Rust application, one
uses the #[global_allocator] macro. Conveniently, the documentation
for GlobalAlloc actually has a full example allocator for us to
explore. It starts out as follows:

const ARENA_SIZE: usize = 128 * 1024;
const MAX_SUPPORTED_ALIGN: usize = 4096;
#[repr(C, align(4096))] // 4096 == MAX_SUPPORTED_ALIGN
struct SimpleAllocator {
    arena: UnsafeCell<[u8; ARENA_SIZE]>,
    remaining: AtomicUsize, // we allocate from the top, counting down
}

We begin by defining a constant arena size of 128 kB, which will be
statically allocated at startup. Arenas are often faster because they
make fewer memory allocation requests to the operating system and
also reduce memory fragmentation, at the cost of memory overhead due
to the unused space in the arena. In almost every real world
application, a memory allocator would use many smaller arenas as
opposed to one large one, and also have the capabilities to add
additional arenas at runtime as the existing ones fill up.

Next we define the representation via repr. The representation
describes the layout of this type in memory. The default
representation is Rust, which means the Rust compiler can do whatever
it wants for optimization purposes. repr(C) lays out the memory like
C/C++ do, which is useful for interoperability with other languages.
It is also somewhat more predictable, as those layouts do not change
anymore. The Rust reference has a detailed breakdown of the different
layout options.

In this case we also define a memory alignment, which means we are
saying that the memory address of the allocator should be a multiple
of 4096. Rust will also make sure that each of the fields will be
aligned in this way, and add padding in between the fields if
necessary. We do this because clients will later request memory with
a certain memory alignment themselves, and aligning the overall array
means any aligned offset inside the array will also be aligned
automatically.

The struct itself has two fields. The arena is a fixed size byte
array, wrapped in an UnsafeCell. UnsafeCell is the base for all the
regular, safe cells like Cell or RefCell, all of which enable
interior mutability. This is required because GlobalAlloc's alloc
method takes a shared &self reference, not a mutable &mut self one,
and returns a *mut u8 pointer into the interior array. UnsafeCell
gives us the required escape hatch to return a mutable pointer into a
struct we only have a shared reference to.

The second field remaining is an offset from the start of the arena,
effectively acting as a pointer into the arena. Because we are
counting down "from the top," it starts out with the size of the
arena, going down as memory is allocated.The benefit of this approach
is that checking if we have enough memory available to satisfy a
request is as simple as checking if remaining is at least as large as
the requested amount of memory.[ ]^1A potential downside is that we
cannot simply grow the most recent allocation without moving it, as
would be possible if counting up from zero.

remaining is an AtomicUsize. Atomics are used to signal to both the
compiler and the CPU that ordering of memory access is important to
ensure correctness. If we were using a plain usize, the compiler or
the CPU could decide that it would be faster to reorder reads or
writes, compromising correctness. Rust inherits the C++ atomics
memory model, which comes with a few ways of describing access
restrictions, which we will dig into in a bit. The Rustonomicon also
has a whole section on atomics with more details.[ ]^2For those
curious, Memory Barriers: a Hardware View for Software Hackers has a
low level description of why and how ordering is done inside CPUs.

#[global_allocator]
static ALLOCATOR: SimpleAllocator = SimpleAllocator {
    arena: UnsafeCell::new([0x55; ARENA_SIZE]),
    remaining: AtomicUsize::new(ARENA_SIZE),
};

This is the bit that actually selects the globally used allocator,
using the aforementioned #[global_allocator] macro. It is simply
creating a static instance of the allocator, initializing the arena
with 0x55.[ ]^3To the best of my knowledge this value is selected
because it results in alternating ones and zeroes, which might aid in
debugging.

unsafe impl Sync for SimpleAllocator {}

We then implement the Sync marker trait for our allocator, which does
not have any methods, but simply tells the compiler that it is safe
to share references to the allocator across thread boundaries. This
is an unsafe trait, and it requires us to make sure that the
resulting behaviour is actually sound. We will see below how we
actually accomplish this.

Now let us look at the implementation for GlobalAlloc, again an
unsafe trait.

unsafe impl GlobalAlloc for SimpleAllocator {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        let size = layout.size();
        let align = layout.align();

The first required method is alloc, which requests some memory from
the allocator according to Layout, expecting a pointer to that
memory. If there is no memory matching the request available, we
return a null pointer. Layout has two methods that are interesting to
us, size and align, which return the desired size and alignment
respectively. The size is just a number of bytes, and the alignment
works just the same as we noted above when looking at repr.

        // `Layout` contract forbids making a `Layout` with align=0, or align not power of 2.
        // So we can safely use a mask to ensure alignment without worrying about UB.
        let align_mask_to_round_down = !(align - 1);

This comment mentions a contract, a requirement that is not captured
outside of documentation. We can assume that alignment is a non-zero
power of two. This is important for the next step. We set up a
bitmask that we will use for ensuring the correct alignment.

Alignment is a power of two by contract, which means it's a single
set bit. By decrementing it by one, we transform e.g. 8 (00001000) to
7 (00000111), and then invert that to set all high bits up to and
including the original one, 11111000. We will then & this onto our
memory address later, which acts as a round down by setting all the
lower bits to zero. Rounding down works because we made sure to align
the overall array, and are allocating from the top down, so that
rounding down introduces a gap "above" the newly allocated memory. If
we were counting from the bottom up, we would be rounding up instead
to introduce a gap "below."

        if align > MAX_SUPPORTED_ALIGN {
            return null_mut();
        }

At this point we just check if the client requested an alignment
higher than we have chosen to support, in which case we return a null
pointer to signal that we cannot fulfill the request. Again, this is
dependent on the alignment of the arena array.

        let mut allocated = 0;
        if self
            .remaining
            .fetch_update(SeqCst, SeqCst, |mut remaining| {
                if size > remaining {
                    return None;
                }
                remaining -= size;
                remaining &= align_mask_to_round_down;
                allocated = remaining;
                Some(remaining)
            })
            .is_err()
        {
            return null_mut();
        };

This is where the heavy lifting of allocation is done, and there is a
lot to unpack here. We recall that remaining is an AtomicUsize
describing an offset from the start of the arena, counting down as we
allocate memory. Atomics rely on the underlying platform to maintain
the correct order of operations, which is why they cannot be read or
set like regular variables, but have a special set of methods to
access them. In this case we are using the fetch_update method, which
is a form of read-write-modify instruction. As mentioned above, if we
were not using an atomic, either the compiler or the CPU itself could
decide to reorder the different reads and writes to the counter for
performance optimizations, compromising correctness.

remaining starts out pointing at the start of the previous
allocation, if any, so we just subtract the size of the new
allocation. Note that this is also where we use our bitmask to round
down the address to get the right alignment.

The function passed to the fetch_update method returns Some
(new_value), in this case Some(remaining), but fetch_update itself
returns Result<old_value, Err>. Incidentally the passed function
might also be called multiple times if there is contention from other
threads. The mutable variable allocated in the outer scope allows us
to extract the new value of remaining without re-fetching remaining
and risking a data race.

fetch_update also takes two arguments which in this case are both
SeqCst, short for sequentially consistent. These are atomic memory
orderings inherited from C++'s memory model, and specify the level of
synchronization required when accessing this memory. There are a few
different levels of which this is the strictest, guaranteeing that no
reads and writes are reordered such that they switch order with the
access to remaining. This is important to avoid data races that could
lead to overlapping allocations or over-committing memory that is not
actually available.

        self.arena.get().cast::<u8>().add(allocated)
    }

We get a pointer to the start of the arena, add the offset of the
newly allocated region from the start of the arena, and then return
it as a pointer to that same region.

This is the reason for wrapping the arena in an UnsafeCell, it allows
us to get a mutable pointer to its contents from a shared reference.
We cannot use std::ptr::addr_of_mut! or just coerce a reference to
the right type because we only have a shared reference to self, and
thus cannot derive a mutable reference without interior mutability.
We could use a RefCell and its as_ptr method though.[ ]^4I actually
don't know why the author chose not to. The pull request adding this
example has no discussion about this option.

    unsafe fn dealloc(&self, _ptr: *mut u8, _layout: Layout) {}

Releasing memory in this case is a no-op, which means that memory
once allocated is gone forever, even after the contained values have
been dropped. In a real-world application, we would probably want to
keep track of memory that has become available again, so that we
could use it again for new allocations.

And that is all that is to it. Of course this is a very simple
allocator, but it is actually functional. It does not actually
dynamically request memory from the operating system, but that is
actually a common pattern in embedded development where there is no
operating system. Still, in those scenarios we would probably want to
reuse freed memory, which would require keeping track of which
regions in the arena are currently in use and which are not, if we
can live with the memory overhead required to store that information.

Built 2024-05-06 09:49