[HN Gopher] A Tour of Safe Tracing GC Designs in Rust
       ___________________________________________________________________
        
       A Tour of Safe Tracing GC Designs in Rust
        
       Author : belter
       Score  : 95 points
       Date   : 2021-06-19 10:01 UTC (13 hours ago)
        
 (HTM) web link (manishearth.github.io)
 (TXT) w3m dump (manishearth.github.io)
        
       | kgeist wrote:
       | >In Go, the compiler needs to automatically insert code that
       | checks the "pulse" of the heap every now and then, and
       | potentially runs garbage collection. It also needs to
       | automatically insert code that can tell the scheduler "hey now is
       | a safe time to interrupt me if a different goroutine wishes to
       | run"
       | 
       | Is it still the case after Go 1.14 (February 2020) when
       | asynchronous preemption was implemented? As far as I know,
       | goroutines are now preempted using signals instead of inserting
       | checks at every function call.
        
         | uluyol wrote:
         | The signal based mechanism is only used for stragglers.
         | 
         | Go still uses the function call check. The runtime devs found a
         | way to combine it with a check to see if the goroutine's stack
         | needs to be resized, so it doesn't actually cost anything.
        
       | pjmlp wrote:
       | Having something like this available as vocabulary types would be
       | quite nice for GUI libraries.
        
       | mjw1007 wrote:
       | Some history of the question of whether Rust should have GC as
       | standard:
       | 
       | 2014-09. RFC 256 https://rust-lang.github.io/rfcs/0256-remove-
       | refcounting-gc-...
       | 
       | << Remove the reference-counting based Gc<T> type from the
       | standard library and its associated support infrastructure from
       | rustc.
       | 
       | Doing so lays a cleaner foundation upon which to prototype a
       | proper tracing GC >>
       | 
       | Contemporary with that:
       | https://news.ycombinator.com/item?id=8312327 nikomatsakis:
       | 
       | << I wouldn't be so quick to give up on GC support yet! "We have
       | a plan!" But I don't think we'll have it in for Rust 1.0. >>
       | 
       | 2015-04:
       | 
       | https://blog.rust-lang.org/2015/04/10/Fearless-Concurrency.h...
       | 
       | "Memory safety without garbage collection." is now advertised as
       | a "key value proposition" (this isn't quite the same as saying
       | "we're not thinking of ever adding garbage collection", of
       | course).
       | 
       | 2016-08: https://manishearth.github.io/blog/2016/08/18/gc-
       | support-in-...
       | 
       | << Recently we (Felix, Niko, and I) have been working on getting
       | compiler-level GC support for Rust. The plan is to provide a base
       | set of APIs and intrinsics on which GCs can be built, without
       | including an actual GC itself. >>
       | 
       | This is the "Unfinished design for a builtin Rust GC" described
       | in TFA.
       | 
       | 2018-10:
       | 
       | the introductory post for the shifgrethor library (described in
       | TFA): https://boats.gitlab.io/blog/post/shifgrethor-i/
       | 
       | includes a bolded << I do not expect we will ever "add GC to
       | Rust" >>.
        
         | dgb23 wrote:
         | It's amazing how open and organic this evolved.
        
       | amelius wrote:
       | Wouldn't it make more sense to have garbage collectors at the
       | level of LLVM?
       | 
       | That way, you could have direct access to all the pointers in a
       | program without jumping through the language's hoops. And other
       | languages could benefit from it too.
        
         | joppy wrote:
         | Yes, it makes total sense for the garbage collector to be
         | tightly coupled to the memory allocation system, which is also
         | tightly coupled to code generation (as is done in the case of
         | various high-performance managed languages: Java, C#, Haskell,
         | ...). I remember hearing a talk a few years ago about a system
         | of the kind you're talking about - a kind of "managed language"
         | micro-core that ran alongside LLVM so that it could hook into
         | code generation, and was intended to be a kind of toolkit for
         | making new managed languages. Can't remember the name...
        
           | an_opabinia wrote:
           | The performance is still too bad for anything but Java on
           | real applications. The other killer feature, ahead of time
           | compilation, is really clunky, and its most innovative
           | change, static initialization at build time by default, was
           | reverted to runtime initialization.
        
           | duped wrote:
           | That sounds a lot like Graal
        
         | saurik wrote:
         | The concept of what a pointer "is" is somewhat language-
         | dependent, particularly as people finally attempt to mitigate
         | the waste of moving to 64-bit by using tagged pointers (which
         | maybe could try to be modelled in LLVM also, but I think you
         | will rapidly realize that LLVM just isn't modeling a lot of
         | things that affect scanning).
        
           | tialaramex wrote:
           | > particularly as people finally attempt to mitigate the
           | waste of moving to 64-bit by using tagged pointers
           | 
           | It's very fraught to try to "mitigate the waste" at the top
           | of the pointer. Everybody involved is clear that what you're
           | doing there is saving up pain for yourself because
           | periodically more of those "wasted" bits become significant
           | and each time that happens if you've been using them now
           | you've incurred a maintenance burden.
           | 
           | The waste at the _bottom_ of the pointer is a far more
           | sensible place to hide information because it won 't get
           | taken away when somebody ships a new CPU, it was already
           | there in 32-bit architectures.
           | 
           | LLVM itself makes use of the latter trick extensively as I
           | understand it, but doesn't use the former. If they've got a
           | 64-byte structure, they can align them, then steal six bits
           | at the bottom of each pointer to those structures for bit
           | flags, and when Intel ships a new CPU their code still works.
           | Whereas if I steal ten bits at the top of the pointer, when
           | Intel ships Ice Lake CPUs my compiler now crashes. Good luck
           | debugging that!
        
             | zozbot234 wrote:
             | > Whereas if I steal ten bits at the top of the pointer,
             | when Intel ships Ice Lake CPUs my compiler now crashes.
             | Good luck debugging that!
             | 
             | That's not how it works. The x86-64 ISA requires these
             | "top" bits to be properly sign extended when actually
             | accessing addresses, so any binary code that's using this
             | "trick" is forward compatible wrt. future CPU's that might
             | enable a bigger virtual address space. The only concern
             | wrt. compatibility is purely ABI related but that can be
             | managed at the OS and system-software layer.
        
               | tialaramex wrote:
               | > That's not how it works. The x86-64 ISA requires these
               | "top" bits to be properly sign extended when actually
               | accessing addresses.
               | 
               | On Coffee Lake no more than 48 bits of the virtual
               | address are "real". If I steal ten of the 64 bits from
               | the top, all the actual address information is preserved
               | and I've lost nothing. My compiler works just fine.
               | 
               | On Ice Lake up to 57 bits of the virtual address are
               | "real". Since I'm stealing ten of the 64 bits from the
               | top, I only have 54 bits left, so 3 significant bits were
               | dropped and I don't have them any more. Sign-extending
               | the wrong value won't magically make it into the right
               | value.
        
               | klodolph wrote:
               | Yes--and my concern is that old binaries will suddenly
               | stop working on newer hardware. This has happened before!
               | 
               | The whole reason the top bits are sign-extended in the
               | first place is to discourage these kinds of shenanigans.
               | I remember the pain of switching 24-bit to 32-bit on M68K
               | Macs.
        
               | celeritascelery wrote:
               | Until the virtual address space gets so big that there
               | are no bits left at the top that are "free". As soon as
               | you have a 64 bit virtual address space, pointer tagging
               | the top bits no longer works. Your program is now broken
               | on the new CPU.
        
               | devit wrote:
               | Of course competent programmers using the top K bits of
               | the pointer would make sure that the kernel and memory
               | allocator operate so that they don't allocate memory
               | outside a (64 - K)-bit address space, regardless of what
               | the CPU supports (e.g. Linux mmap as one would expect
               | won't allocate beyond 2^48 unless explicitly requested).
               | 
               | And obviously any intelligent CPU architect, if they make
               | a CPU that masks the upper bits, they would make the
               | number of bits that gets masked configurable per-process
               | and not dependent on the maximum virtual address space
               | supported by the CPU.
        
               | celeritascelery wrote:
               | > Of course competent programmers ... And obviously any
               | intelligent CPU architect
               | 
               | Interesting to see the future is so "obvious" to you. You
               | are correct however, that this is not a problem at the
               | present because no architecture uses more then 48-bits of
               | address space in practice. And since this is such a
               | common practice now, there will have to allowances made
               | in the future for when the address space is expanded.
        
               | vlovich123 wrote:
               | As a sibling to you noted, Ice Lake has bumped this from
               | 48 bits to 57 bits to support 5-level paging [1]. So the
               | "in practice" appears to already by out of date. This was
               | known publicly as early as 2017 (albeit this is the first
               | I've heard of it).
               | 
               | That being said, I don't know why the increase from 256TB
               | to 128PB is at all useful but it's enabled in my Arch
               | Linux kernel on a 32GB laptop, so clearly adopted by the
               | broader ecosystem even outside whatever niche use-case
               | it's for.
               | 
               | That being said, the kernel maintains backward
               | compatability by restricting the allocation address space
               | it hands out by default. Userspace needs to explicitly
               | ask for the higher bits [2], so it's probably OK to just
               | continue as-is assuming the 48-bit limit in practice.
               | 
               | [1] https://en.wikipedia.org/wiki/Intel_5-level_paging
               | 
               | [2]
               | https://www.kernel.org/doc/html/latest/x86/x86_64/5level-
               | pag....
        
               | zozbot234 wrote:
               | > And obviously any intelligent CPU architect, if they
               | make a CPU that masks the upper bits, they would make the
               | number of bits that gets masked configurable per-process
               | 
               | x86-64 does this by requiring all accesses to be
               | correctly sign-extended, which means that valid virtual
               | addresses are unique and there is full forward
               | compatibility even with an extended virtual address
               | space. The "number of bits that get masked" is a pure
               | system/ABI concern, the CPU support only defines a
               | maximum.
        
             | saurik wrote:
             | The longer I stare at your comment the deeper the feeling
             | that you are hung up on a semantic issue feels. While the
             | intermediate states I went through were maybe a bit more
             | interestingly-explanatory with respect to specific
             | techniques, the final state I landed in is simply: "if you
             | only used the bottom single bit to tag that the entire
             | value is a value instead of a pointer, that alone mitigates
             | the waste of moving to 64-bits by having fewer indirect
             | allocations".
        
               | tialaramex wrote:
               | The 32-bit platform can do this same trick, so, your
               | 64-bit pointers are still wider than theirs, and you
               | didn't gain anything _unless_ you also steal high bits.
               | 
               | This isn't quite entirely true, for them address space is
               | very tight and so they daren't just align everything and
               | risk wasting too much address space, whereas on a 64-bit
               | platform you can afford to be very generous with address
               | space. This is akin to how it's fine to have randomised
               | privacy addressing in IPv6, but that'd be a problem in
               | IPv4. But this gain is hard to give a specific value to,
               | whereas the simple fact is that 64-bit pointers are
               | exactly twice as big as 32-bit pointers.
               | 
               | This can give you the incentive to build a different data
               | structure, it's surprising how often "Just use a vector"
               | is correct these days, Chandler Carruth has talked about
               | that at length (in C++). If you have at least some idea
               | how many Foozles you need, and Foozles aren't some crazy
               | multi-page data structure, chances are you should put all
               | the Foozles in a vector, and not carry around actual
               | pointers to Foozles at all.
        
             | xfer wrote:
             | Isn't using bottom bit is how most runtimes represent their
             | ints taking advantage of the fact that gc pointers are
             | aligned? This is common practice afaiu.
        
               | tialaramex wrote:
               | Sure, but for the bottom bit, 64-bit addressing doesn't
               | come into it. If you only want to steal bits at the
               | bottom and have pointers with greater than single byte
               | alignment this worked just fine on a 32-bit CPU already.
               | 
               | I was reacting to the claim that this mitigates the
               | overhead (larger pointer size) of 64-bit addressing. You
               | _can_ steal bits at the top but it 's probably a bad
               | idea.
        
         | boredprograming wrote:
         | LLVM has built in hooks for GC, used by Azul's Zing JVM JIT
         | compiler and perhaps Safari's LLVM JS JIT.
         | 
         | I'm not sure if the GC integration is tied to JIT support or
         | not. I think it's mostly related to insertion of safepoints
         | which could be useful for Rust implementation
        
       | celeritascelery wrote:
       | For me, the most interesting one is Josephine[1]. This allows
       | tracking of stack roots, doesn't limit the time when GC can run,
       | and takes advantage of rust's affine type system. However I have
       | to say the API feels fairly clunky, with the need to root
       | everything before mutation and the need to pass context objects
       | to every method.
       | 
       | [1] https://github.com/asajeffrey/josephine
        
       | zozbot234 wrote:
       | The OP does not seem to mention CactusRef
       | https://crates.io/crates/cactusref which provides deterministic
       | deallocation (on par with simple refcounting) of reference
       | cycles, with minimal tracing requirements and minimal memory
       | overhead.
        
         | pcwalton wrote:
         | It requires unsafe.
        
           | zozbot234 wrote:
           | It _used to_ require unsafe. The project has been reworked to
           | add safer alternatives (at the cost of some increased
           | overhead).
        
         | sizediterable wrote:
         | Probably because cactusref was only published 5 days ago and
         | this article is from three months ago...
        
       | boredprograming wrote:
       | One day, Rust needs a GC. Reference counting is a just crappy GC.
       | Modern GC can perform better than this so Rust is actually
       | hurting its own performance by not having one.
       | 
       | A good GC would make heavily concurrent apps much easier to build
       | with Rust. And would have better performance than the typical Arc
       | Mutex objects passed around right now
        
         | mhh__ wrote:
         | I would need some data on that _but_ I have to say that it
         | always makes me laugh when people _only_ take about the GC in
         | threads about D. It 's so good for productivity. I don't really
         | like it but I can't describe just how much of a non-issue it is
         | for us (The company I work for)
        
         | _flux wrote:
         | Tracing GC is troublesome for any non-memory resource, such as
         | network connection or file handle, due to its untimely release,
         | but otherwise I actually agree: reference counting is a GC
         | mechanism--not a very good one, but it's the only one I'm aware
         | of that works both for memory and resources.
         | 
         | I would enjoy someone test a model where the type system
         | guarantees (or at least lets you detect the situation) that you
         | cannot store such non-memory objects behind a traced GC node
         | (these would include plain memory objects that need to be
         | registered/unregistered precisely).
         | 
         | It might be that it would be needlessly annoying to use just
         | compared to just RC. Or maybe it would be best of both worlds?
        
           | pjmlp wrote:
           | Not when the language also supports value types and region
           | allocators (e.g. IDispose in .NET).
           | 
           | You can even turn it into RAII proper, by turning into a
           | compilation error not handling those interfaces properly.
           | 
           | Again with .NET, there are SafeHandles as well, alongside the
           | MarshalInterop APIs.
           | 
           | This is nothing new actually, Mesa/Cedar for Xerox PARC used
           | reference counting with a cycle collector, while other
           | descendent languages down to Modula-3 and Active Oberon
           | always combined value types, tracing GC and C++ like resource
           | handling capabilities.
           | 
           | Oh Common Lisp also has similar capabilities, specially the
           | ZetaLisp predecessor from Lisp Machines.
           | 
           | Then Eiffel not only had this, it was also probably the first
           | Algol like language to support non nullable references.
           | 
           | Sadly they decided to ignore all of this in Java, and then
           | its world domination kind of made everyone else ignore it as
           | well.
           | 
           | Thankfully even Java is improving their story in this regard,
           | while languages like D, Nim and yes .NET kind of show what
           | was already available for several decades.
        
             | _flux wrote:
             | I must be missing something. How is it possible to
             | precisely collect a resource with tracing GC? And if you
             | need to update counters when you make duplicates of object
             | references, you are not using a tracing GC where the
             | benefits are the cheap duplication of object references,
             | cheap allocations and cheap (batched) releases, but the
             | downside is not being able to precisely and automatically
             | do it when the value is available for collection.
             | 
             | Seems to me it is impossible to have both automatic precise
             | release of a resources and collection-based GC?
             | 
             | As I understand it, even the documentation for IDisposable
             | in .NET says as much at https://docs.microsoft.com/en-
             | us/dotnet/api/system.idisposab...:
             | 
             | > The primary use of this interface is to release unmanaged
             | resources. The garbage collector automatically releases the
             | memory allocated to a managed object when that object is no
             | longer used. However, it is not possible to predict when
             | garbage collection will occur. Furthermore, the garbage
             | collector has no knowledge of unmanaged resources such as
             | window handles, or open files and streams.
             | 
             | > Use the Dispose method of this interface to explicitly
             | release unmanaged resources in conjunction with the garbage
             | collector. The consumer of an object can call this method
             | when the object is no longer needed.
             | 
             | So this is the interface you can use to explicitly release
             | a resource, because the GC gets around to it only later at
             | some unspecified time.
             | 
             | About SafeHandle it says at https://docs.microsoft.com/en-
             | us/dotnet/api/system.runtime.i...:
             | 
             | > The SafeHandle class provides critical finalization of
             | handle resources, preventing handles from being reclaimed
             | prematurely by garbage collection and from being recycled
             | by Windows to reference unintended unmanaged objects.
             | 
             | Doesn't seem it's at all helpful for automatic precise
             | release of resources.
        
               | pjmlp wrote:
               | You aren't reading it properly, the documentation you are
               | reading is for the case you leave the work to the GC, you
               | can take it yourself C++ RAII style:                  {
               | using my_socket = new NetworkSocket()             }
               | // my_socket no longer exists when code arrives here
               | 
               | Or even better if NetworkSocket is a struct, it gets
               | stack allocated, zero GC.
        
               | zozbot234 wrote:
               | > the benefits are the cheap duplication of object
               | references, cheap allocations and cheap (batched)
               | releases, but the downside is not being able to precisely
               | and automatically do it when the value is available for
               | collection.
               | 
               | Note that you don't need GC to reap these benefits, if
               | desired. You can allocate an arena and do secondary
               | allocations inside it, then deallocate everything in a
               | single operation. Arena deallocation is not timely or
               | precise, but it does happen deterministically.
        
               | _flux wrote:
               | True, but GC gives those benefits automatically, compared
               | to a naive program doing e.g. RC-based memory management.
               | 
               | And there is of course the question of safety; should you
               | release an arena too early, you may have introduced a
               | bug. Worse: it might not crash immediately.
               | 
               | There is actually some work for doing arena management
               | automatically, called region inference:
               | http://www.mlton.org/Regions
               | 
               | But the way I see it, it's just a way to make memory
               | management even more efficient; it's not about precise
               | release of resources, and indeed not all programs can be
               | expressed so that releases can happen only in batches of
               | an arena (assuming those arenas themselves aren't
               | dynamically managed, which certainly is a valid strategy
               | as well, but manual).
        
               | zozbot234 wrote:
               | > should you release an arena too early, you may have
               | introduced a bug.
               | 
               | A memory safe programmming language will detect any such
               | bugs and reject the program. This is not hard, it's a
               | clean application of existing lifetime checks.
        
               | _flux wrote:
               | So are there some languages that do it? I'm sure the
               | devil is in the details.
        
         | astrange wrote:
         | Tracing GC has poor memory performance because it has to access
         | rarely used or swapped out pages to scan them for pointers. And
         | of course, the peak memory use is much higher since it doesn't
         | free everything as soon as possible.
         | 
         | There may be advantages if you can use it to add compaction,
         | but I don't think you need a GC to do that necessarily.
        
           | pjmlp wrote:
           | Actually it is the other way around.
           | 
           | https://github.com/ixy-languages/ixy-languages
           | 
           | No wonder that M1 has specific architecture optimizations
           | that help streamline ARC boilerplate code, while Swift 5.5
           | will bring more aggressive optimizations (disabled by
           | default, because application can crash if weak/owned
           | references are annotated improperly => WWDC 2021 talk)
        
             | astrange wrote:
             | This isn't representative of application code and there
             | isn't even any mention of the metrics I mentioned...
             | 
             | > No wonder that M1 has specific architecture optimizations
             | that help streamline ARC boilerplate code
             | 
             | No it doesn't. I told you it didn't the last time you said
             | this.
        
               | pjmlp wrote:
               | > This isn't representative of application code and there
               | isn't even any mention of the metrics I mentioned...
               | 
               | Yeah, that is the usual answer when benchmarks prove how
               | much urban myth reference counting performance is
               | actually like.
               | 
               | > No it doesn't. I told you it didn't the last time you
               | said this.
               | 
               | Did you?
               | 
               | There is more important stuff in life to store on my
               | brain than a list of who replies to me on hacker news.
               | 
               | Anyway,
               | 
               | https://github.com/apple/swift/blob/main/stdlib/public/Sw
               | ift...
               | 
               | https://twitter.com/ErrataRob/status/1331735383193903104
        
               | celeritascelery wrote:
               | Thanks for sharing these links! Super interesting. I do
               | have a question though. The ixy benchmarks seem to imply
               | that RC is generally slower then GC (go and C# are much
               | faster then swift and are only outdone by languages with
               | manual memory management).
               | 
               | However in the tweet thread you shared, the poster said
               | 
               | > all that reference counting overhead (already more
               | efficient than garbage collection) gets dropped in half.
               | 
               | Implying that reference counting is actually more
               | efficient. I don't know how to rectify these two
               | observations. Do you have any insights?
        
               | pjmlp wrote:
               | The observation is done by point of view of Swift
               | developers.
               | 
               | The only reason why Swift has reference counting was
               | historical.
               | 
               | Objective-C GC implementation failed, because it was very
               | hard to mix frameworks compiled with and without GC
               | enabled, alongside the usual issues of C memory
               | semantics.
               | 
               | https://developer.apple.com/library/archive/documentation
               | /Co...
               | 
               | Check "Inapplicable Patterns" section.
               | 
               | So Apple did the right design decision, instead of trying
               | to fix tracing GC in such environment, just like
               | Microsoft does in COM, they looked into Cocoa
               | [retain/release] pattern, automated that, and in a
               | marketing swoop sold that solution as ARC.
               | 
               | Swift as Objective-C replacement, naturally had to build
               | on top of ARC as means to keep compatibility with
               | Objective-C runtime without additional overhead (check
               | RCW/CCW for how .NET GC deals with COM).
               | 
               | Here is a paper about Swift performance,
               | 
               | http://iacoma.cs.uiuc.edu/iacoma-papers/pact18.pdf
               | 
               | > As shown in the figure, performing RC operations takes
               | on average 42% of the execution time in client programs,
               | and 15% in server programs. The average across all
               | programs can be shown to be 32%. The Swift compiler does
               | implement optimization techniques to reduce the number of
               | RC operations similar to those described in Section
               | 2.2.2. Without them, the overhead would be higher. The RC
               | overhead is lower in server programs than in client
               | programs. This is because server programs spend
               | relatively less time in Swift code and RC operations;
               | they spend relatively more time in runtime functions for
               | networking and I/O written in C++.
               | 
               | It makes technical sense that Swift uses reference
               | counting, as explained above, but it isn't due to
               | performance, it just sells better than explaining it was
               | due to Objective-C inherited C memory model, which
               | besides the memory corruption problems, it doesn't allow
               | for anything better than a conservative garbage
               | collector, with very bad performance.
               | 
               | https://hboehm.info/gc/
        
               | astrange wrote:
               | > Yeah, that is the usual answer when benchmarks prove
               | how much urban myth reference counting performance is
               | actually like.
               | 
               | CPU/wall time benchmarks are not that relevant to system
               | performance (seriously!) because second-order effects
               | matter more. But if you had peak memory and page demand
               | graphs that would matter.
               | 
               | For a network driver I don't know if it'd really look any
               | different though. That's mostly arena allocations.
               | 
               | >
               | https://twitter.com/ErrataRob/status/1331735383193903104
               | 
               | The fast atomics and JavaScript instructions do exist but
               | aren't "special", they're just part of the ARM ISA.
        
       ___________________________________________________________________
       (page generated 2021-06-19 23:01 UTC)