[HN Gopher] Tracing garbage collection for arenas
       ___________________________________________________________________
        
       Tracing garbage collection for arenas
        
       Author : sirwhinesalot
       Score  : 56 points
       Date   : 2024-06-24 11:31 UTC (2 days ago)
        
 (HTM) web link (btmc.substack.com)
 (TXT) w3m dump (btmc.substack.com)
        
       | zozbot234 wrote:
       | Interesting discussion. If you want to experiment with
       | 'lightweight' traced GC for systems programming, you might like
       | https://github.com/chc4/samsara (for Rust) or
       | https://github.com/pebal/sgcl (for C++) both implementing a
       | concurrent tracing algorithm which is a bit more involved than
       | what's directly discussed in OP.
        
         | sirwhinesalot wrote:
         | Hi author here, I wasn't aware of those, I should probably link
         | to some existing library implementations of GCs.
         | 
         | My goal was really to make the simplest, dumbest thing possible
         | that still works. It can't compete with more serious GCs like
         | that, but hey, it's intentionally not meant to.
        
           | titzer wrote:
           | Virgil uses a simple stop-the-world semispace copying
           | collector. I did that because it's better to start moving
           | rather than nonmoving. I think the hotly contested questions
           | in systems programming are all complete red herrings. It's
           | not about GC versus non-GC, it's about gracefully handling
           | unsafe code, integrating JITed code, interfacing with the
           | lowest level APIs (e.g. kernel syscalls), implementing the
           | runtime system, etc. Data formats like network packets, file
           | formats, in-memory data structures, etc. That stuff is way
           | more important than if your goddamn lists of widgets are
           | heap, stack, or region allocated.
        
             | diffxx wrote:
             | I am confused by this comment. All the enumerated features
             | are dependent directly or indirectly on the memory
             | management approach that the system uses. If one chooses
             | badly for the target workload, there are many things that
             | will go wrong including excessive copying (which kills the
             | performance of the system) and inscrutable crashes/logic
             | errors as well as security vulnerabilities.
             | 
             | Personally, I tend to think that if you nail the memory
             | management technique all the other items will tend to work
             | themselves out. And I think this is actually harder to get
             | right _systemically_ than all of those other things.
        
               | titzer wrote:
               | > if you nail the memory management technique all the
               | other items will tend to work themselves out.
               | 
               | From my experience, implementing a language runtime for a
               | GC'd language in a non-GC'd language will lead you down
               | one of two well-trodded roads full of footguns and booby
               | traps, which includes either conservative-on-the-stack
               | scanning or a huge PITA handle system which has a bug
               | tail measured in decades. So you want to implement a GC'd
               | language in a GC'd language. But that's probably not what
               | you meant.
               | 
               | To the larger point, no, absolutely not. There are so
               | many other considerations for systems programming that
               | this whole "GC or not" debate is a waste of time. Read
               | the comment again. System programming deals with
               | constraints imposed by other hardware and software, as
               | well as the need to look at the implementation guts of
               | things, generate new code, etc. Solving the language's
               | memory management problems is only tangentially related,
               | meaning, if you do solve it, then you still have these
               | problems left over to solve.
        
               | zozbot234 wrote:
               | > I am confused by this comment. All the enumerated
               | features are dependent directly or indirectly on the
               | memory management approach that the system uses.
               | 
               | The key feature of low-level systems programming is that
               | there's no such thing as "the" one memory-management
               | approach. GC with arenas can be used in a "pluggable" way
               | for the limited case of objects referencing one another
               | in a fully general graph (including possible cycles)
               | while preserving other, more lightweight approaches
               | (including refcounting and simple RAII/static ownership
               | tracking) for the bulk of cases where the full generality
               | of GC isn't always required.
        
             | sirwhinesalot wrote:
             | Yup, 100% agreed. GC is a tool, we need to talk about
             | _specific_ GCs and their costs as to whether or not they
             | 're suitable for a particular task.
             | 
             | For me the main issue has to do with FFI and needing to
             | drag around a hefty runtime to make a library available to
             | other languages.
             | 
             | Not an issue with simpler GCs.
        
               | zozbot234 wrote:
               | FFI from GC'd to non-GC'd languages becomes workable if
               | you can manually add FFI-referenced objects as roots (so
               | that a collection cycle will not free them or anything
               | they reference in turn) and ensure that the GC can call
               | finalizers in order to ensure proper disposal (such as
               | decreasing reference counts or freeing memory) whenever a
               | GC object happens to control the lifecycle of non-GC'd
               | resources.
        
               | sirwhinesalot wrote:
               | I meant the other direction actually, non-GC'd calling
               | GC'd. FFI from GC'd to non-GC'd has its issues but is
               | thankfully much better explored.
               | 
               | With a minimal ref counting "GC" on GC'd side, you just
               | need 2 extra "runtime" functions (incref, decref), which
               | happen to fit like a glove with the scope-based resource
               | management you have in Rust/C++/Swift/C (the latter with
               | GCC extensions).
               | 
               | With a tracing GC, my hope was to prove that if you make
               | it as "minimal" as your usual ref counting
               | implementation, then it's also just a few functions (one
               | to init a GC heap, one to allocate on that heap, one to
               | collect that heap), which can hardly be considered a
               | "runtime".
               | 
               | Your typical tracing GC is a large, powerful and complex
               | beast that doesn't play nice with anything besides the
               | language/vm it was designed for.
        
               | neonsunset wrote:
               | You pretty much described how FFI works in .NET :)
               | 
               | When you are passing a pointer across FFI which points to
               | an object interior, you "pin" such object with `fixed`
               | statement which toggles a bit in the header of that
               | object. Conveniently, it is also used by GC during the
               | mark phase so your object is effectively pre-marked as
               | live. Objects pinned in such way cannot be moved,
               | ensuring the pointer to their interiors remains valid.
               | 
               | It's not as simple implementation-wise - .NET's GC has
               | multiple strategies to minimize the impact of pinned
               | objects on GC throughput and heap size. Additionally, for
               | objects that are expected to be long-lived and pinned
               | throughout their entire lifetime, there's a separate
               | Pinned Object Heap to further improve the
               | interoperability. In practical terms, this is not used
               | that often because most of the time you pass a struct or
               | a pointer to a struct on the stack that needs no
               | marshalling or pinning. In the rare case the pointer
               | needs to point to a long-lived pinned array, these are
               | allocated with `GC.AllocateArray<T>(length, isPinned)`.
               | 
               | .NET has another interesting feature that is important to
               | FFI: ByRefs which are special pointers that GC is aware
               | of that can point to arbitrary memory. You can receive an
               | int*, len from C and wrap it into a Span<int> (which
               | would be ref int + length), and GC will be able to
               | efficiently disambiguate that such pointer does not point
               | to GC-owned heap and can be safely ignored while should
               | it point to object memory, the memory range needs to be
               | marked and byrefs pointing to it need to be updated if it
               | is moved. That Span<int> can then be consumed by most
               | standard library APIs that offer span overloads next to
               | the array ones (there are more and more that accept just
               | spans, since many types with contiguous memory can be
               | treated as one).
               | 
               | This works the other way too and there is a separate
               | efficient mechanism for pinning byrefs when passing them
               | as plain pointers to C/C++/Rust/etc.
        
       | efitz wrote:
       | Despite the title "tracing garbage collection for arenas", the
       | article is not about analysis of sanitation practices at sports
       | facilities.
        
         | colechristensen wrote:
         | I'd totally watch a 5-30 minute documentary about waste
         | management in large venues. Bonus points if there were
         | questionable infographics about beer and hotdogs.
        
         | sirwhinesalot wrote:
         | This got a healthy chuckle out of me, well done :)
        
         | xcskier56 wrote:
         | I was definitely convinced that the article was somehow going
         | to pivot to this topic until it started getting into "arenas"
        
       | forrestthewoods wrote:
       | > Can we make "the simplest form" of tracing garbage collection
       | work for systems programming?
       | 
       | No. 24 bytes per allocation and having to perform a copy and
       | point fixup is too expensive.
       | 
       | The problem with GC is that it's fine until it's not. And when it
       | becomes not fine you need to radically change your entire
       | architecture. There is no incremental fix.
       | 
       | That said, Unreal Engine does in fact use a mark-and-sweep GC for
       | certain types of objects. Which could be something for the author
       | to look into.
        
         | sirwhinesalot wrote:
         | Are 16/24 bytes that much worse than 8/16 (with weak refs) for
         | ref counting?
         | 
         | The copying can be too expensive yes, though hopefully the
         | ability to use non-garbage collected arenas would mitigate that
         | issue (just don't put huge live data in the default, GCed
         | arena)
        
           | forrestthewoods wrote:
           | > Are 16/24 bytes that much worse than 8/16 (with weak refs)
           | for ref counting?
           | 
           | No, but that's not the right question. Most allocations in
           | languages like C/C++/Rust aren't using ref counts. In fact
           | excessive use of std::shared_ptr is a well known footgun in
           | C++!
           | 
           | > the ability to use non-garbage collected arenas would
           | mitigate that issue
           | 
           | I agree that not using GC is a way to mitigate GC issues! But
           | if you're only able to performantly use GC for a limited
           | number of allocations you also aren't getting much benefit
           | from GC.
        
             | sirwhinesalot wrote:
             | Note that you can allocate as much temp data as you want to
             | the GC heap without incurring any performance issues (in
             | fact allocation will be faster than malloc), because
             | copying collectors only ever care about live objects.
             | 
             | So the question is, how much of an issue it is to "contort"
             | the program to not allocate large, live objects to the GC
             | heap, vs contorting it to have a tree-like structure for
             | RAII-style memory management.
             | 
             | Honestly, I have no idea, one of the reasons I really want
             | to make a toy language to try it out and actually measure
             | the runtime costs vs convenience etc.
             | 
             | Note: For my particular C implementation, which is
             | admittedly cobbled together with spit and ducktape, you can
             | easily add support for individually managed allocations to
             | the GC heap, since it already uses buckets anyway.
             | 
             | If a bucket contains only 1 allocation, there's no need to
             | copy, you can just keep that bucket somewhere in the linked
             | list. This does require the programmer to explicitly
             | request that kind of allocation though.
             | 
             | For the "huge amount of small objects" case, GC and RAII/RC
             | are complete opposites. If most of those small objects die,
             | the GC will be a lot faster since it drops them by not
             | copying them. But if most of the objects are live, then the
             | GC will spend a brutal amount of time copying, while
             | RAII/RC will barely do any work.
             | 
             | Having memory management options is a good thing ;)
        
         | amelius wrote:
         | Manual garbage collection can also be expensive, in some cases
         | even more expensive than automatic garbage collection.
        
       | celeritascelery wrote:
       | Interesting write up. I am curious about the usage of this
       | framework. One thing in reference counting's favor is that it is
       | hard to screw up. You just use the rc type and when it is no
       | longer accessible it will magically get freed. That is how GC
       | works in a high level language too, you almost never need to
       | think about it.
       | 
       | Let's look at the example from the docs:
       | SWL_GCArena gc = swl_gc_arena_new(<starting-capacity>);
       | int* foo = swl_gc_alloc(&gc, 5000, alignof(int), NULL);
       | int* bar = swl_gc_alloc(&gc, 16000, alignof(int), NULL);
       | int* baz = swl_gc_alloc(&gc, 3000, alignof(int), NULL);
       | // foo and baz are roots         SWL_GC_COLLECT(&gc,
       | SWL_ROOT(foo), SWL_ROOT(baz));         // bar is gone, poof
       | 
       | If you try to access bar at the end you get a use after free. I
       | am pretty sure if you tried to access foo again it would trigger
       | the same problem, because it is a pointer directly to int and not
       | an indirection. What happens if you pass a non-arena pointer to
       | SWL_ROOT? What happens if you have multiple arenas and you mix
       | and match them?
       | 
       | There are many ways you could trivially shoot yourself in the
       | foot in large program. It is a lot less "idiot-proof" than simple
       | reference counting. The fact that you create each piece of memory
       | and then have to manually pass it back to the root macro on each
       | collection is very similar to the "malloc/free" dance we want to
       | try and avoid.
       | 
       | The rooting issue is something I tried to tackle in my GC for
       | rust[1]. Essentially making low level GC "idiot proof".
       | 
       | [1] https://coredumped.dev/2022/04/11/implementing-a-safe-
       | garbag...
        
         | sirwhinesalot wrote:
         | The idea is that this would be used in a language backend, so
         | the compiler would track the roots for you. Note that in C ref
         | counting is also rather error prone in that you can increment
         | or decrement a reference the wrong number of times.
         | 
         | But I agree this is not the best thing to do manually. The main
         | advantage is that you only need to call collect in 1 place in
         | your whole app. A lot easier to get right than having free all
         | over the place.
         | 
         | EDIT: placed your article on my reading list, thanks for
         | sharing it!
        
         | forrestthewoods wrote:
         | > One thing in reference counting's favor is that it is hard to
         | screw up.
         | 
         | Well, circular references will cause your RC objects to never
         | ever drop. So then you have to start using a bunch of Weak
         | references. And it can very quickly become a confusing and hard
         | to understand mess.
         | 
         | The cost of a memory leak is significantly lower than the cost
         | of a use after free. But disentangling the graph of refcounts
         | can be super painful.
        
       | jkcxn wrote:
       | How do you move memory and have all the pointers update to the
       | new position? I didn't understand that part.
        
         | sirwhinesalot wrote:
         | Hi, author here.
         | 
         | It's handled through the actual GC implementation, Cheney's
         | Algorithm to be specific, it's mentioned in the blog post but I
         | didn't write how it actually works there.
         | 
         | My git repo link at the end has an actual C implementation if
         | you'd like to have a look.
         | 
         | The TL;DR is that the algorithm starts by going through the
         | roots and copying each of them to a new Arena. Each time it
         | does it updates the root pointer to the new location and writes
         | out a "forwarding pointer" in the old allocation such that if 2
         | roots go to the same object, they can check if it's already
         | been forwarded, and replace themselves with that pointer
         | without copying.
         | 
         | Once all the roots are copied, the algorithm goes iteratively
         | through every allocation in the new Arena, digging through
         | their pointers, doing the same copy/update/forward/dance onto
         | the same Arena, until it reaches a fixpoint (no new copies
         | happen). Then it frees the old Arena.
         | 
         | It's a really simple algorithm, I strongly suggest having a
         | look at the wikipedia page :)
        
         | tekknolagi wrote:
         | The pointers that the programmer controls live on the shadow
         | stack and that's how they get updated. Another term for this is
         | a "GC handle"
        
       | moonchild wrote:
       | i would do a variable-size header 4 or 8 bytes for most small
       | objects (8 if you want to provide alignment, 4 if not or if you
       | can squish your pointers). one bit identifies if the header is
       | small or large; if small, next few bits give the size of the
       | object in words, and the remainder are a bitmap telling which
       | words are pointers. forwarding pointer stomps on the first word
       | of the object
        
       ___________________________________________________________________
       (page generated 2024-06-26 23:01 UTC)