[HN Gopher] A Tour of Safe Tracing GC Designs in Rust
___________________________________________________________________
A Tour of Safe Tracing GC Designs in Rust
Author : belter
Score : 95 points
Date : 2021-06-19 10:01 UTC (13 hours ago)
(HTM) web link (manishearth.github.io)
(TXT) w3m dump (manishearth.github.io)
| kgeist wrote:
| >In Go, the compiler needs to automatically insert code that
| checks the "pulse" of the heap every now and then, and
| potentially runs garbage collection. It also needs to
| automatically insert code that can tell the scheduler "hey now is
| a safe time to interrupt me if a different goroutine wishes to
| run"
|
| Is it still the case after Go 1.14 (February 2020) when
| asynchronous preemption was implemented? As far as I know,
| goroutines are now preempted using signals instead of inserting
| checks at every function call.
| uluyol wrote:
| The signal based mechanism is only used for stragglers.
|
| Go still uses the function call check. The runtime devs found a
| way to combine it with a check to see if the goroutine's stack
| needs to be resized, so it doesn't actually cost anything.
| pjmlp wrote:
| Having something like this available as vocabulary types would be
| quite nice for GUI libraries.
| mjw1007 wrote:
| Some history of the question of whether Rust should have GC as
| standard:
|
| 2014-09. RFC 256 https://rust-lang.github.io/rfcs/0256-remove-
| refcounting-gc-...
|
| << Remove the reference-counting based Gc<T> type from the
| standard library and its associated support infrastructure from
| rustc.
|
| Doing so lays a cleaner foundation upon which to prototype a
| proper tracing GC >>
|
| Contemporary with that:
| https://news.ycombinator.com/item?id=8312327 nikomatsakis:
|
| << I wouldn't be so quick to give up on GC support yet! "We have
| a plan!" But I don't think we'll have it in for Rust 1.0. >>
|
| 2015-04:
|
| https://blog.rust-lang.org/2015/04/10/Fearless-Concurrency.h...
|
| "Memory safety without garbage collection." is now advertised as
| a "key value proposition" (this isn't quite the same as saying
| "we're not thinking of ever adding garbage collection", of
| course).
|
| 2016-08: https://manishearth.github.io/blog/2016/08/18/gc-
| support-in-...
|
| << Recently we (Felix, Niko, and I) have been working on getting
| compiler-level GC support for Rust. The plan is to provide a base
| set of APIs and intrinsics on which GCs can be built, without
| including an actual GC itself. >>
|
| This is the "Unfinished design for a builtin Rust GC" described
| in TFA.
|
| 2018-10:
|
| the introductory post for the shifgrethor library (described in
| TFA): https://boats.gitlab.io/blog/post/shifgrethor-i/
|
| includes a bolded << I do not expect we will ever "add GC to
| Rust" >>.
| dgb23 wrote:
| It's amazing how open and organic this evolved.
| amelius wrote:
| Wouldn't it make more sense to have garbage collectors at the
| level of LLVM?
|
| That way, you could have direct access to all the pointers in a
| program without jumping through the language's hoops. And other
| languages could benefit from it too.
| joppy wrote:
| Yes, it makes total sense for the garbage collector to be
| tightly coupled to the memory allocation system, which is also
| tightly coupled to code generation (as is done in the case of
| various high-performance managed languages: Java, C#, Haskell,
| ...). I remember hearing a talk a few years ago about a system
| of the kind you're talking about - a kind of "managed language"
| micro-core that ran alongside LLVM so that it could hook into
| code generation, and was intended to be a kind of toolkit for
| making new managed languages. Can't remember the name...
| an_opabinia wrote:
| The performance is still too bad for anything but Java on
| real applications. The other killer feature, ahead of time
| compilation, is really clunky, and its most innovative
| change, static initialization at build time by default, was
| reverted to runtime initialization.
| duped wrote:
| That sounds a lot like Graal
| saurik wrote:
| The concept of what a pointer "is" is somewhat language-
| dependent, particularly as people finally attempt to mitigate
| the waste of moving to 64-bit by using tagged pointers (which
| maybe could try to be modelled in LLVM also, but I think you
| will rapidly realize that LLVM just isn't modeling a lot of
| things that affect scanning).
| tialaramex wrote:
| > particularly as people finally attempt to mitigate the
| waste of moving to 64-bit by using tagged pointers
|
| It's very fraught to try to "mitigate the waste" at the top
| of the pointer. Everybody involved is clear that what you're
| doing there is saving up pain for yourself because
| periodically more of those "wasted" bits become significant
| and each time that happens if you've been using them now
| you've incurred a maintenance burden.
|
| The waste at the _bottom_ of the pointer is a far more
| sensible place to hide information because it won 't get
| taken away when somebody ships a new CPU, it was already
| there in 32-bit architectures.
|
| LLVM itself makes use of the latter trick extensively as I
| understand it, but doesn't use the former. If they've got a
| 64-byte structure, they can align them, then steal six bits
| at the bottom of each pointer to those structures for bit
| flags, and when Intel ships a new CPU their code still works.
| Whereas if I steal ten bits at the top of the pointer, when
| Intel ships Ice Lake CPUs my compiler now crashes. Good luck
| debugging that!
| zozbot234 wrote:
| > Whereas if I steal ten bits at the top of the pointer,
| when Intel ships Ice Lake CPUs my compiler now crashes.
| Good luck debugging that!
|
| That's not how it works. The x86-64 ISA requires these
| "top" bits to be properly sign extended when actually
| accessing addresses, so any binary code that's using this
| "trick" is forward compatible wrt. future CPU's that might
| enable a bigger virtual address space. The only concern
| wrt. compatibility is purely ABI related but that can be
| managed at the OS and system-software layer.
| tialaramex wrote:
| > That's not how it works. The x86-64 ISA requires these
| "top" bits to be properly sign extended when actually
| accessing addresses.
|
| On Coffee Lake no more than 48 bits of the virtual
| address are "real". If I steal ten of the 64 bits from
| the top, all the actual address information is preserved
| and I've lost nothing. My compiler works just fine.
|
| On Ice Lake up to 57 bits of the virtual address are
| "real". Since I'm stealing ten of the 64 bits from the
| top, I only have 54 bits left, so 3 significant bits were
| dropped and I don't have them any more. Sign-extending
| the wrong value won't magically make it into the right
| value.
| klodolph wrote:
| Yes--and my concern is that old binaries will suddenly
| stop working on newer hardware. This has happened before!
|
| The whole reason the top bits are sign-extended in the
| first place is to discourage these kinds of shenanigans.
| I remember the pain of switching 24-bit to 32-bit on M68K
| Macs.
| celeritascelery wrote:
| Until the virtual address space gets so big that there
| are no bits left at the top that are "free". As soon as
| you have a 64 bit virtual address space, pointer tagging
| the top bits no longer works. Your program is now broken
| on the new CPU.
| devit wrote:
| Of course competent programmers using the top K bits of
| the pointer would make sure that the kernel and memory
| allocator operate so that they don't allocate memory
| outside a (64 - K)-bit address space, regardless of what
| the CPU supports (e.g. Linux mmap as one would expect
| won't allocate beyond 2^48 unless explicitly requested).
|
| And obviously any intelligent CPU architect, if they make
| a CPU that masks the upper bits, they would make the
| number of bits that gets masked configurable per-process
| and not dependent on the maximum virtual address space
| supported by the CPU.
| celeritascelery wrote:
| > Of course competent programmers ... And obviously any
| intelligent CPU architect
|
| Interesting to see the future is so "obvious" to you. You
| are correct however, that this is not a problem at the
| present because no architecture uses more then 48-bits of
| address space in practice. And since this is such a
| common practice now, there will have to allowances made
| in the future for when the address space is expanded.
| vlovich123 wrote:
| As a sibling to you noted, Ice Lake has bumped this from
| 48 bits to 57 bits to support 5-level paging [1]. So the
| "in practice" appears to already by out of date. This was
| known publicly as early as 2017 (albeit this is the first
| I've heard of it).
|
| That being said, I don't know why the increase from 256TB
| to 128PB is at all useful but it's enabled in my Arch
| Linux kernel on a 32GB laptop, so clearly adopted by the
| broader ecosystem even outside whatever niche use-case
| it's for.
|
| That being said, the kernel maintains backward
| compatability by restricting the allocation address space
| it hands out by default. Userspace needs to explicitly
| ask for the higher bits [2], so it's probably OK to just
| continue as-is assuming the 48-bit limit in practice.
|
| [1] https://en.wikipedia.org/wiki/Intel_5-level_paging
|
| [2]
| https://www.kernel.org/doc/html/latest/x86/x86_64/5level-
| pag....
| zozbot234 wrote:
| > And obviously any intelligent CPU architect, if they
| make a CPU that masks the upper bits, they would make the
| number of bits that gets masked configurable per-process
|
| x86-64 does this by requiring all accesses to be
| correctly sign-extended, which means that valid virtual
| addresses are unique and there is full forward
| compatibility even with an extended virtual address
| space. The "number of bits that get masked" is a pure
| system/ABI concern, the CPU support only defines a
| maximum.
| saurik wrote:
| The longer I stare at your comment the deeper the feeling
| that you are hung up on a semantic issue feels. While the
| intermediate states I went through were maybe a bit more
| interestingly-explanatory with respect to specific
| techniques, the final state I landed in is simply: "if you
| only used the bottom single bit to tag that the entire
| value is a value instead of a pointer, that alone mitigates
| the waste of moving to 64-bits by having fewer indirect
| allocations".
| tialaramex wrote:
| The 32-bit platform can do this same trick, so, your
| 64-bit pointers are still wider than theirs, and you
| didn't gain anything _unless_ you also steal high bits.
|
| This isn't quite entirely true, for them address space is
| very tight and so they daren't just align everything and
| risk wasting too much address space, whereas on a 64-bit
| platform you can afford to be very generous with address
| space. This is akin to how it's fine to have randomised
| privacy addressing in IPv6, but that'd be a problem in
| IPv4. But this gain is hard to give a specific value to,
| whereas the simple fact is that 64-bit pointers are
| exactly twice as big as 32-bit pointers.
|
| This can give you the incentive to build a different data
| structure, it's surprising how often "Just use a vector"
| is correct these days, Chandler Carruth has talked about
| that at length (in C++). If you have at least some idea
| how many Foozles you need, and Foozles aren't some crazy
| multi-page data structure, chances are you should put all
| the Foozles in a vector, and not carry around actual
| pointers to Foozles at all.
| xfer wrote:
| Isn't using bottom bit is how most runtimes represent their
| ints taking advantage of the fact that gc pointers are
| aligned? This is common practice afaiu.
| tialaramex wrote:
| Sure, but for the bottom bit, 64-bit addressing doesn't
| come into it. If you only want to steal bits at the
| bottom and have pointers with greater than single byte
| alignment this worked just fine on a 32-bit CPU already.
|
| I was reacting to the claim that this mitigates the
| overhead (larger pointer size) of 64-bit addressing. You
| _can_ steal bits at the top but it 's probably a bad
| idea.
| boredprograming wrote:
| LLVM has built in hooks for GC, used by Azul's Zing JVM JIT
| compiler and perhaps Safari's LLVM JS JIT.
|
| I'm not sure if the GC integration is tied to JIT support or
| not. I think it's mostly related to insertion of safepoints
| which could be useful for Rust implementation
| celeritascelery wrote:
| For me, the most interesting one is Josephine[1]. This allows
| tracking of stack roots, doesn't limit the time when GC can run,
| and takes advantage of rust's affine type system. However I have
| to say the API feels fairly clunky, with the need to root
| everything before mutation and the need to pass context objects
| to every method.
|
| [1] https://github.com/asajeffrey/josephine
| zozbot234 wrote:
| The OP does not seem to mention CactusRef
| https://crates.io/crates/cactusref which provides deterministic
| deallocation (on par with simple refcounting) of reference
| cycles, with minimal tracing requirements and minimal memory
| overhead.
| pcwalton wrote:
| It requires unsafe.
| zozbot234 wrote:
| It _used to_ require unsafe. The project has been reworked to
| add safer alternatives (at the cost of some increased
| overhead).
| sizediterable wrote:
| Probably because cactusref was only published 5 days ago and
| this article is from three months ago...
| boredprograming wrote:
| One day, Rust needs a GC. Reference counting is a just crappy GC.
| Modern GC can perform better than this so Rust is actually
| hurting its own performance by not having one.
|
| A good GC would make heavily concurrent apps much easier to build
| with Rust. And would have better performance than the typical Arc
| Mutex objects passed around right now
| mhh__ wrote:
| I would need some data on that _but_ I have to say that it
| always makes me laugh when people _only_ take about the GC in
| threads about D. It 's so good for productivity. I don't really
| like it but I can't describe just how much of a non-issue it is
| for us (The company I work for)
| _flux wrote:
| Tracing GC is troublesome for any non-memory resource, such as
| network connection or file handle, due to its untimely release,
| but otherwise I actually agree: reference counting is a GC
| mechanism--not a very good one, but it's the only one I'm aware
| of that works both for memory and resources.
|
| I would enjoy someone test a model where the type system
| guarantees (or at least lets you detect the situation) that you
| cannot store such non-memory objects behind a traced GC node
| (these would include plain memory objects that need to be
| registered/unregistered precisely).
|
| It might be that it would be needlessly annoying to use just
| compared to just RC. Or maybe it would be best of both worlds?
| pjmlp wrote:
| Not when the language also supports value types and region
| allocators (e.g. IDispose in .NET).
|
| You can even turn it into RAII proper, by turning into a
| compilation error not handling those interfaces properly.
|
| Again with .NET, there are SafeHandles as well, alongside the
| MarshalInterop APIs.
|
| This is nothing new actually, Mesa/Cedar for Xerox PARC used
| reference counting with a cycle collector, while other
| descendent languages down to Modula-3 and Active Oberon
| always combined value types, tracing GC and C++ like resource
| handling capabilities.
|
| Oh Common Lisp also has similar capabilities, specially the
| ZetaLisp predecessor from Lisp Machines.
|
| Then Eiffel not only had this, it was also probably the first
| Algol like language to support non nullable references.
|
| Sadly they decided to ignore all of this in Java, and then
| its world domination kind of made everyone else ignore it as
| well.
|
| Thankfully even Java is improving their story in this regard,
| while languages like D, Nim and yes .NET kind of show what
| was already available for several decades.
| _flux wrote:
| I must be missing something. How is it possible to
| precisely collect a resource with tracing GC? And if you
| need to update counters when you make duplicates of object
| references, you are not using a tracing GC where the
| benefits are the cheap duplication of object references,
| cheap allocations and cheap (batched) releases, but the
| downside is not being able to precisely and automatically
| do it when the value is available for collection.
|
| Seems to me it is impossible to have both automatic precise
| release of a resources and collection-based GC?
|
| As I understand it, even the documentation for IDisposable
| in .NET says as much at https://docs.microsoft.com/en-
| us/dotnet/api/system.idisposab...:
|
| > The primary use of this interface is to release unmanaged
| resources. The garbage collector automatically releases the
| memory allocated to a managed object when that object is no
| longer used. However, it is not possible to predict when
| garbage collection will occur. Furthermore, the garbage
| collector has no knowledge of unmanaged resources such as
| window handles, or open files and streams.
|
| > Use the Dispose method of this interface to explicitly
| release unmanaged resources in conjunction with the garbage
| collector. The consumer of an object can call this method
| when the object is no longer needed.
|
| So this is the interface you can use to explicitly release
| a resource, because the GC gets around to it only later at
| some unspecified time.
|
| About SafeHandle it says at https://docs.microsoft.com/en-
| us/dotnet/api/system.runtime.i...:
|
| > The SafeHandle class provides critical finalization of
| handle resources, preventing handles from being reclaimed
| prematurely by garbage collection and from being recycled
| by Windows to reference unintended unmanaged objects.
|
| Doesn't seem it's at all helpful for automatic precise
| release of resources.
| pjmlp wrote:
| You aren't reading it properly, the documentation you are
| reading is for the case you leave the work to the GC, you
| can take it yourself C++ RAII style: {
| using my_socket = new NetworkSocket() }
| // my_socket no longer exists when code arrives here
|
| Or even better if NetworkSocket is a struct, it gets
| stack allocated, zero GC.
| zozbot234 wrote:
| > the benefits are the cheap duplication of object
| references, cheap allocations and cheap (batched)
| releases, but the downside is not being able to precisely
| and automatically do it when the value is available for
| collection.
|
| Note that you don't need GC to reap these benefits, if
| desired. You can allocate an arena and do secondary
| allocations inside it, then deallocate everything in a
| single operation. Arena deallocation is not timely or
| precise, but it does happen deterministically.
| _flux wrote:
| True, but GC gives those benefits automatically, compared
| to a naive program doing e.g. RC-based memory management.
|
| And there is of course the question of safety; should you
| release an arena too early, you may have introduced a
| bug. Worse: it might not crash immediately.
|
| There is actually some work for doing arena management
| automatically, called region inference:
| http://www.mlton.org/Regions
|
| But the way I see it, it's just a way to make memory
| management even more efficient; it's not about precise
| release of resources, and indeed not all programs can be
| expressed so that releases can happen only in batches of
| an arena (assuming those arenas themselves aren't
| dynamically managed, which certainly is a valid strategy
| as well, but manual).
| zozbot234 wrote:
| > should you release an arena too early, you may have
| introduced a bug.
|
| A memory safe programmming language will detect any such
| bugs and reject the program. This is not hard, it's a
| clean application of existing lifetime checks.
| _flux wrote:
| So are there some languages that do it? I'm sure the
| devil is in the details.
| astrange wrote:
| Tracing GC has poor memory performance because it has to access
| rarely used or swapped out pages to scan them for pointers. And
| of course, the peak memory use is much higher since it doesn't
| free everything as soon as possible.
|
| There may be advantages if you can use it to add compaction,
| but I don't think you need a GC to do that necessarily.
| pjmlp wrote:
| Actually it is the other way around.
|
| https://github.com/ixy-languages/ixy-languages
|
| No wonder that M1 has specific architecture optimizations
| that help streamline ARC boilerplate code, while Swift 5.5
| will bring more aggressive optimizations (disabled by
| default, because application can crash if weak/owned
| references are annotated improperly => WWDC 2021 talk)
| astrange wrote:
| This isn't representative of application code and there
| isn't even any mention of the metrics I mentioned...
|
| > No wonder that M1 has specific architecture optimizations
| that help streamline ARC boilerplate code
|
| No it doesn't. I told you it didn't the last time you said
| this.
| pjmlp wrote:
| > This isn't representative of application code and there
| isn't even any mention of the metrics I mentioned...
|
| Yeah, that is the usual answer when benchmarks prove how
| much urban myth reference counting performance is
| actually like.
|
| > No it doesn't. I told you it didn't the last time you
| said this.
|
| Did you?
|
| There is more important stuff in life to store on my
| brain than a list of who replies to me on hacker news.
|
| Anyway,
|
| https://github.com/apple/swift/blob/main/stdlib/public/Sw
| ift...
|
| https://twitter.com/ErrataRob/status/1331735383193903104
| celeritascelery wrote:
| Thanks for sharing these links! Super interesting. I do
| have a question though. The ixy benchmarks seem to imply
| that RC is generally slower then GC (go and C# are much
| faster then swift and are only outdone by languages with
| manual memory management).
|
| However in the tweet thread you shared, the poster said
|
| > all that reference counting overhead (already more
| efficient than garbage collection) gets dropped in half.
|
| Implying that reference counting is actually more
| efficient. I don't know how to rectify these two
| observations. Do you have any insights?
| pjmlp wrote:
| The observation is done by point of view of Swift
| developers.
|
| The only reason why Swift has reference counting was
| historical.
|
| Objective-C GC implementation failed, because it was very
| hard to mix frameworks compiled with and without GC
| enabled, alongside the usual issues of C memory
| semantics.
|
| https://developer.apple.com/library/archive/documentation
| /Co...
|
| Check "Inapplicable Patterns" section.
|
| So Apple did the right design decision, instead of trying
| to fix tracing GC in such environment, just like
| Microsoft does in COM, they looked into Cocoa
| [retain/release] pattern, automated that, and in a
| marketing swoop sold that solution as ARC.
|
| Swift as Objective-C replacement, naturally had to build
| on top of ARC as means to keep compatibility with
| Objective-C runtime without additional overhead (check
| RCW/CCW for how .NET GC deals with COM).
|
| Here is a paper about Swift performance,
|
| http://iacoma.cs.uiuc.edu/iacoma-papers/pact18.pdf
|
| > As shown in the figure, performing RC operations takes
| on average 42% of the execution time in client programs,
| and 15% in server programs. The average across all
| programs can be shown to be 32%. The Swift compiler does
| implement optimization techniques to reduce the number of
| RC operations similar to those described in Section
| 2.2.2. Without them, the overhead would be higher. The RC
| overhead is lower in server programs than in client
| programs. This is because server programs spend
| relatively less time in Swift code and RC operations;
| they spend relatively more time in runtime functions for
| networking and I/O written in C++.
|
| It makes technical sense that Swift uses reference
| counting, as explained above, but it isn't due to
| performance, it just sells better than explaining it was
| due to Objective-C inherited C memory model, which
| besides the memory corruption problems, it doesn't allow
| for anything better than a conservative garbage
| collector, with very bad performance.
|
| https://hboehm.info/gc/
| astrange wrote:
| > Yeah, that is the usual answer when benchmarks prove
| how much urban myth reference counting performance is
| actually like.
|
| CPU/wall time benchmarks are not that relevant to system
| performance (seriously!) because second-order effects
| matter more. But if you had peak memory and page demand
| graphs that would matter.
|
| For a network driver I don't know if it'd really look any
| different though. That's mostly arena allocations.
|
| >
| https://twitter.com/ErrataRob/status/1331735383193903104
|
| The fast atomics and JavaScript instructions do exist but
| aren't "special", they're just part of the ARM ISA.
___________________________________________________________________
(page generated 2021-06-19 23:01 UTC)