[HN Gopher] Crafting Interpreters with Rust: On Garbage Collection
       ___________________________________________________________________
        
       Crafting Interpreters with Rust: On Garbage Collection
        
       Author : amalinovic
       Score  : 164 points
       Date   : 2024-07-30 12:57 UTC (1 days ago)
        
 (HTM) web link (www.tunglevo.com)
 (TXT) w3m dump (www.tunglevo.com)
        
       | chc4 wrote:
       | The other major issue with the Deref implementation is that
       | `&mut` needs to be an exclusive reference, and if you're doing
       | mark/sweep of GC objects via references you break that invariant
       | if you hold any `&mut` across a GC call and are causing UB. In
       | practice this probably doesn't affect your program, but I suspect
       | Miri would yell at you and there's the off chance the compiler
       | gets really tricky with inlining somewhere and you are inflicted
       | with horrible pain.
        
         | ltungv wrote:
         | Yeah true, this breaks all sorts of contracts that we have with
         | Rust xD. But if mark-and-sweep is implemented correctly, then
         | no reference is ever held across the GC. Though, there's gonna
         | be lots of pain debugging when you got it wrong, speaking from
         | experience.
         | 
         | Do you have any resources about Rust inlining and the issues it
         | might cause? I'd love to read more about that.
        
           | galangalalgol wrote:
           | With a borrowchecker what does GC let you do that is more
           | ergonomic? I have never used a borrowchecked and garbage
           | colected language so I have no experience to consult.
        
             | ltungv wrote:
             | The borrow-checker helps when you're writing Rust code. But
             | when writing an interpreter for another language, you kinda
             | have to support its semantics. In Lox, there's no move
             | semantic, no borrowing, almost everything is a heap-
             | allocated object, variables can alias, etc. Thus, you need
             | to have a way to manage the memory of the implemented
             | language without the borrow-checker. Here, the borrow-
             | checker can help with implementing the GC safely and
             | correctly, but I didn't utilize it.
        
           | chc4 wrote:
           | Mark and sweep doesn't stop you from holding references
           | across GC.
           | 
           | If you write e.g.
           | 
           | ``` let obj = some_object(); let len : &mut usize = &mut
           | obj.len; // deref_mut trigger_gc(); use(*len); ```
           | 
           | then you held a reference across the GC, and while it's
           | mark/sweeping created an aliases `&mut` to `len`.
           | 
           | Inlining was mention just because it causes function bodies
           | to be splatting together, and so puts together code that is
           | fine in isolation in a way that may allow Rust to observe UB:
           | if `trigger_gc` was inlined for example then Rust has more
           | information and code available at once, and might use the
           | knowledge to apply some optimizations that are invalid
           | because you caused UB.
           | 
           | Actually, looking at your code the much larger issue is that
           | nothing stops you from doing
           | 
           | ``` let obj = some_object(); let obj2 = obj.clone(); let len1
           | = &mut obj.len; let len2 = &mut obj2.len; let n = *len1;
           | *len2 = 0; println!("{n}"); // what does this print? ```
           | because your Deref mut are just creating unconstrained borrow
           | from pointer types that you can copy. This is almost
           | definitely going to cause you bugs in the future, since it
           | opens you up to accidentally writing iterator invalidation
           | and other issues (and even worse than C++ does, because C++
           | doesn't optimize references as heavily as Rust does borrows)
        
             | ltungv wrote:
             | Yeah, it's a self-enforced contract that objects of type
             | Gc<T> are only ever touched by the virtual machine, but I
             | get your point.
        
               | adrian17 wrote:
               | The issue is that it's a simple footgun as soon as you
               | start adding non-trivial native methods (say, string
               | methods, array.map etc). The only way to make sure that
               | they don't exist is removing the DerefMut impl entirely.
               | 
               | It's not just that it's possible to break the code, but
               | that lack of any checks makes it impossible to detect at
               | either compile-time or runtime and have it "appear to
               | work".
               | 
               | One way to solve it to remove the DerefMut implementation
               | and have it work more like Rc - as in, the user is forced
               | to write `Gc<Cell<T>>` or `Gc<RefCell<T>>` if they want
               | mutability. This solves the aliasing borrows issue at the
               | cost of extra runtime checks with RefCell (and still
               | doesn't prevent you from unsoundly calling `sweep()`
               | while holding a Gc)
        
               | ltungv wrote:
               | I implemented exactly what you were saying here (https://
               | github.com/ltungv/rox/commit/6a611e7acb3b36d0a3a4376...).
               | But where's the fun in that?
        
           | ekidd wrote:
           | I would be really careful with those deref methods. They
           | return references, not pointers, which means you need to
           | follow the Rust rules:
           | 
           | - You can have any number of simultaneous readers,
           | 
           | - _Or_ one writer and no readers.
           | 
           | - _But_ if you ever break these rules, the world may burn.
           | 
           | Using unsafe and violating these rules is one of the cases
           | where the Rust compiler can inflict worlds of misery:
           | Incorrect code generation, CPUs seeing different versions of
           | the same memory location, etc., depending on the exact
           | context. Once you use "unsafe" and break the rules, Rust can
           | be more dangerous than C. Rust even reserves the right to
           | generate code using "noalias" (except doing so often triggers
           | LLVM bugs, so it's usually turned off).
           | 
           | "Undefined behavior" means "anything at all can happen, and
           | some of it is deeply weird and awful and counterintuitive."
           | 
           | You could enforce the borrowing rules at runtime by using
           | std::cell::Cell in your heap objects, which is exactly what
           | it exists for. Or you could package everything inside a tiny
           | core module and audit it extremely carefully.
        
             | ltungv wrote:
             | I'm aware of all the issues mentioned. But for this
             | particular project, I simply don't care as long as it
             | passes Lox's test suite xD. I went this path just to see
             | how easy it is to get tripped by unsafe while knowing that
             | there's a technique to get safety with Pin<T> that this can
             | get refactored into. I actually implemented this with Cell
             | and RefCell but didn't find that interesting.
        
             | slaymaker1907 wrote:
             | You would probably want to use RefCell instead of Cell. It
             | allows you to safely upgrade into a &mut using only a
             | constant reference to RefCell, but it dynamically verifies
             | that it's actually safe using ref counting. The ref
             | counting also isn't too expensive since it isn't atomic.
        
       | jcdavis wrote:
       | Fun writeup. When I went through and implement the book in rust
       | (https://github.com/jcdavis/rulox), I just used Rc and never
       | really solved the cycle issue.
       | 
       | I'll +1 and say I highly recommend going through Crafting
       | Interpreters, particularly as a way of building non-trivial
       | programs in languages you are less familiar with - If you just
       | follow the java/C example you are tempted to lean into
       | copy/pasting the samples, but figuring things out in other
       | languages is a great experience.
       | 
       | I spent a longass time tracking down an issue with how the
       | reference interpreter implementation defines tokens:
       | https://github.com/munificent/craftinginterpreters/issues/11...
       | which was frustrating but good debugging practice.
        
         | lawn wrote:
         | > If you just follow the java/C example you are tempted to lean
         | into copy/pasting the samples, but figuring things out in other
         | languages is a great experience.
         | 
         | I'll echo this recommendation.
         | 
         | I haven't gone through Crafting Interpreters yet (it's on the
         | bookshelf) but I did go through a big chunk of Building Git and
         | instead of Ruby I did it in Rust, to get back into the language
         | and go a little deeper.
         | 
         | It was really great and it gave me a lot more than if I'd just
         | copy paste the code as you say.
        
         | ghosty141 wrote:
         | My only issue with Crafting Interpreters is that it relies too
         | much on writing code and then explaining it instead of getting
         | the concepts across first and then providing the code. Another
         | thing I disliked was some code was structured in a way that it
         | would go well with concepts that came further down the
         | implementation which can lead to confusion since it seems
         | overcomplicated/overabstracted at first.
        
           | grumpyprole wrote:
           | My minor nit pick is that it uses mostly the Java language to
           | implement a Java-like language. What have we gained? Surely
           | something Python-like in C would be more rewarding, one would
           | be gaining a level of abstraction.
        
             | munificent wrote:
             | _> Surely something Python-like in C would be more
             | rewarding, one would be gaining a level of abstraction._
             | 
             | That's the second half of the book.
        
               | grumpyprole wrote:
               | My apologies, that does indeed seem like a reasonable way
               | of doing it.
        
           | munificent wrote:
           | One of the biggest challenges with writing is linearizing:
           | deciding what order to present the material so that it makes
           | the most sense.
           | 
           | There's really no perfect solution. Some readers prefer
           | bottom up where they are given individual pieces they can
           | understand which then get composed. Others prefer top down
           | where they are given the high level problem being solved in
           | terms of intermediate steps which then get incrementally
           | defined. Some want to see code first so that they have
           | something concrete to hang concepts off of. Others want
           | concepts first and then to see the code as an illustration of
           | it.
           | 
           | This is made even harder with Crafting Interpreters because
           | the book's central conceit is that every line of code in the
           | interpreters is shown in the book. There's nothing left to
           | the reader. And, also, the book tries to give the user a
           | program they can compile and run as often as possible as they
           | work through things.
           | 
           | I did the best I could, but there are definitely some places
           | where it's really hard to figure out a good order. Sometimes
           | the book explicitly says "sorry, this won't make sense but
           | bear with me and we'll circle back".
           | 
           | Also, at the macro structure, the entire book is organized
           | into two parts where the first one is higher-level and more
           | concept driven and then the second part circles all the way
           | back to build the entire language from scratch again but at a
           | much lower level of abstraction.
           | 
           | I appreciate your feedback. It's a hard problem and a skill
           | I'm always trying to improve.
        
           | scott_s wrote:
           | I _love_ the code-first approach. It was a revelation to me
           | when I first read Mark Pilgrim 's Dive Into Python
           | (https://diveintopython3.net/).
        
         | Dowwie wrote:
         | did you not want anyone to ever find rulox on github? it has no
         | descriptive information about it in the repo
        
       | jlewallen wrote:
       | Crafting Interpreters is such an amazing work.
       | 
       | There's at least one other Rust implementation of lox that I know
       | of (https://github.com/tdp2110/crafting-interpreters-rs) (no
       | unsafe)
       | 
       | It's always interesting to see how different people approach the
       | problems in their own language or relative isolation. I agree
       | with others here, the real value of the original work lies in
       | avoiding copy and paste.
        
         | munificent wrote:
         | There are a whole pile of Lox implementations in Rust (as well
         | as many other lanugages):
         | 
         | https://github.com/munificent/craftinginterpreters/wiki/Lox-...
        
           | timsneath wrote:
           | Lox must have the highest ratio of (implementations :
           | production usage) of any language on the planet. And I mean
           | that as the highest praise -- it's proven a fantastic
           | teaching language, and your book is infectious at encouraging
           | others to jump in and follow along in various different
           | languages.
           | 
           | I've also found the exercise of implementing Lox in another
           | language as highly instructive in learning how to write
           | idiomatic code in that language. I continue to learn more
           | about the best way to express patterns as I work on my own
           | implementation. I'd recommend the journey to any professional
           | developer for this side-benefit alone.
        
             | stevekemp wrote:
             | > Lox must have the highest ratio of (implementations :
             | > production usage) of any language on the planet.
             | 
             | It's probably up there, for sure! But I'd guess that there
             | are a million toy Lisp implementations, and more people are
             | interested in writing a FORTH interpreter than actually
             | using one in production. So I'd guess if we tried to get
             | statistics it wouldn't be at the top.
             | 
             | Though there's probably a similar claim to be made for the
             | Monkey-language from Torsten Bell, via his books on
             | compilers and interpreters.
             | 
             | https://monkeylang.org/
        
             | munificent wrote:
             | _> Lox must have the highest ratio of (implementations :
             | production usage) of any language on the planet._
             | 
             | Maybe! It's definitely getting there. I suspect "semi-
             | arbitrary subset of C" still has me beat but who knows for
             | how much longer.
        
               | leftyspook wrote:
               | All praise Bob!
               | 
               | On a more serious note, have you thought about trying to
               | aim lightning at the same spot again and write another
               | book about implementing something most programmers take
               | for granted?
        
               | munificent wrote:
               | I've definitely thought about writing a third book. I
               | don't know if it would be about "something most
               | programmers take for granted". I'm more interested in
               | writing about whatever happens to excite me the most at
               | that time.
        
             | diffxx wrote:
             | I'd imagine cool is up there as well:
             | https://nguyenthanhvuh.github.io/class-compilers/cool.html
        
       | samsartor wrote:
       | I have run into a lot of similar problems writing a state
       | management framework for Rust (wip at
       | https://gitlab.com/samsartor/hornpipe) and at one point despaired
       | that I would have to abandon Rust and make a whole new reactive
       | language. But the last couple years I've got it working really
       | nicely with weak references and software transactional memory.
       | Every reference is `Copy + Send + Sync + 'static`. And you can
       | mutate objects by having a transaction make a local bitwise copy,
       | which will get atomically merged and swapped on commit. The old
       | copy gets kept around for undo/redo purposes. I've still got a
       | boatload of algorithmic puzzles to solve to provide all the MVP
       | features, which will take a while because it isn't my full-time
       | job. But the details seem technically sound.
       | 
       | I did write a blog post about some of my design thoughts,
       | although it doesn't dig into the technical guts:
       | https://samsartor.com/guis-3/
        
       | ceronman wrote:
       | Nice article. A couple of years ago I also implemented Lox in
       | Rust. And I faced the exact same issues that the author describes
       | here, and I also ended up with a very similar implementation.
       | 
       | I also wrote a post about it: https://ceronman.com/2021/07/22/my-
       | experience-crafting-an-in...
       | 
       | I ended up having two implementations: One in purely safe Rust
       | and another one with unsafe.
       | 
       | Note that if you're interesting in the "object manager" approach
       | mentioned, I did that in my safe implementation, you can take a
       | look at https://github.com/ceronman/loxido
        
         | ltungv wrote:
         | I'd read your article, and it was lovely. It nudged me to just
         | go unsafe and implement some of the data structures from
         | scratch.
        
       | Daffodils wrote:
       | Just wanted to share that the Build your own Interpreters
       | challenge on CodeCrafters is based on the Crafting Interpreters
       | book and is free at the moment.
       | 
       | Link: https://app.codecrafters.io/courses/interpreter/overview
       | 
       | (Disclaimer: I work for CodeCrafters, and built this challenge).
        
       | bombela wrote:
       | > How much memory is used for the stack and when it is freed can
       | always be determined at compile-time.
       | 
       | Technically, not always. You can allocate on the stack at
       | runtime, see "alloca" in C.
       | 
       | But is alloca even that useful in practice? I struggle to find a
       | good example.
        
         | tick_tock_tick wrote:
         | I mean it's basically the fastest allocation possible. It's
         | amazing under a very very unique situation.
         | 
         | - you need a lot of memory
         | 
         | - it needs to be as fast as possible
         | 
         | - it's a leaf function that can never call another function
         | 
         | 99.9% of the time those aren't all true.
        
           | alexchamberlain wrote:
           | Why does it have to be a leaf function?
        
             | jfoutz wrote:
             | So the massive stack grab isn't permanent.
             | 
             | Conceptually, if the compiler could just inline all of the
             | child calls, it's fine. But if this call won't return for a
             | long time, or may be called again (even worse). You're
             | setting aside stack that you'll never get back. Which is a
             | memory leak, and your program will crash.
        
               | alexchamberlain wrote:
               | That reasoning leads me to conclude the function should
               | be short lived or the memory in use by all child
               | functions, rather than it has to be a leaf function?
        
               | jfoutz wrote:
               | yeah. Leaf is a good rule of thumb, but if you need a
               | little helper to do something it's generally fine.
               | 
               | you really do need confidence the little helper can be
               | fully inlined. if the helper is out of your control, and
               | someone greatly expands its responsibility, it can suck.
               | 
               | So, yeah, effectively a leaf.
        
               | bombela wrote:
               | The stack grab is as permanent as any other stack
               | variable. There is nothing special about it. The only
               | difference is that you can decide at runtime how much to
               | add to the current stack frame.
        
               | mypalmike wrote:
               | If the memory's lifetime correctly coincides with the
               | function (which is why you might use alloca in the first
               | place), I don't see how this would be a memory leak nor
               | why it would lead to a crash. Maybe on systems which
               | limit stack size...?
        
           | slaymaker1907 wrote:
           | There's another reason: we need to do string stuff and we
           | absolutely can't stall. Malloc always has some chance of
           | stalling due to heap fragmentation. The leaf function
           | requirement is also not necessary. You just have to make sure
           | you don't alloca too much memory given how much is remaining
           | on the stack and roughly how much memory you know subsequent
           | calls will take.
           | 
           | It's not all that much worse than using a huge, static stack
           | allocation. In fact, there are cases where alloca is safer
           | since you can take into account how much stack memory you
           | already have available before incrementing the stack pointer.
        
         | ltungv wrote:
         | Oh right. I read about this once or twice and have never looked
         | at it again. I've made the (wrong) assumption that this
         | function needs some compile-time help. Is it all done at
         | runtime? Or does that depend on the actual C implementation?
        
           | bombela wrote:
           | Stack allocation is always done at runtime. Baring special
           | optimizations, the CPU and the code work together to (stack
           | grows downward) decrement the CPU stack pointer register.
           | This register keeps track of where the bottom of the stack
           | is.
           | 
           | What is done at compile time by the compiler, is calculating
           | the amount to decrement the CPU stack pointer register at
           | runtime for every stack variables.
           | 
           | The compiler and alloca must work together, because
           | optimization can easily break something here. For example,
           | when a function return, the stack pointer register better be
           | restored properly. Furthermore, in Rust you must not forget
           | to call Drop::drop() for all types that have such a
           | destructor.
        
         | ozgrakkurt wrote:
         | This was used in linux kernel and they were removing it later
         | because it was a huge pain as far as I can remember. Seems like
         | it causes problems because entire system is built assuming
         | stack/heap work in a certain way.
        
       | kaspermarstal wrote:
       | > Don't ask me about covariance since I honestly don't know the
       | answer. I recommend reading the section on subtyping and variance
       | and The Rustonomicon. I went through them a few times but still
       | don't entirely get all the details. Then, we can define each type
       | of object like so:
       | 
       | Haha this is gold. We ALL do this ALL the time.
        
       | lowbloodsugar wrote:
       | TL;DR: Don't use unsafe to break the rules. Use unsafe to enforce
       | the rules in new ways.
       | 
       | Great journey. That's the thing about "unsafe" and "questionable
       | parts"... if they really are questionable, then don't do 'em. In
       | this case, if it is the case that holding a reference to a GC
       | object guarantees that that object can't be freed, then it's not
       | questionable. Proving that to be case can be fun tho.
       | 
       | The question is: does my unsafe code allow violating the borrow
       | rules?
       | 
       | Cell, RefCell, RwLock etc give you interior mutability, but there
       | is no way to violate the borrow rules with them. On the surface,
       | it looks like you are violating them because "I have a non-mut
       | thing that I can mutate".
       | 
       | If you build a thing that allows two distinct &mut T's to the
       | same thing, then you are basically asking for a bug. Don't use
       | unsafe to break the rules. Use unsafe to enforce the rules in new
       | ways.
        
         | ltungv wrote:
         | If someone else depends on this project, I'd definitely not
         | implement the questionable stuff. You're right that proving
         | whether it's safe can be lots of fun, and I'm planning to try
         | it out.
         | 
         | Based on what I've read, it's entirely possible. If my Rust
         | knowledge is correct, there are 2 things that have to be proven
         | - the object's lifetime and the borrow rules.
         | 
         | Proving the lifetime can be done at compile-time based on what
         | was discussed in these articles
         | https://without.boats/tags/shifgrethor/. But that still leaves
         | us with proving that we don't violate the borrow rules. AFAIK,
         | the only way to do it in my implementation is with Cell and
         | RefCell enforcing the borrow rules at runtime.
         | 
         | Before removing all the safety checks, I actually used
         | Gc<RefCell<T>> for mutable objects, so the borrow rules were
         | enforced but not the lifetime. However, I decided to remove
         | RefCell<T> to have some fun and see how I could shoot myself in
         | the foot.
        
           | adrian17 wrote:
           | Just for reference, the gc-arena crate
           | (https://github.com/kyren/gc-arena) we use in Ruffle's
           | Actionscript 1/2/3 interpreters and across the entire engine,
           | does this. As far as we know, it's safe and sound, at the
           | cost of some limitations, mainly: to compile-time prevent
           | sweeping while holding onto Gc references, every struct and
           | function touching Gc objects passes an extra 'gc lifetime.
           | And not being able to run the GC at arbitrary time can be
           | major issue depending on what you're doing; in Ruffle we
           | incrementally collect between frames and pray that no single
           | frame will ever allocate enough to OOM in the middle of it :)
           | 
           | And yeah, RefCells everywhere.
        
       | scott_s wrote:
       | In case the author reads this: please explicitly cite all of
       | Nystrom's figures. A link is not enough.
       | 
       | Even with a citation, I'm not quite comfortable just reusing
       | someone else's figures so many times when they're doing so much
       | heavy lifting. But an explicit citation is the minimum.
        
         | imachine1980_ wrote:
         | idk if he/she change it, but i see the name big and center
         | after each use.
        
         | ltungv wrote:
         | Thanks for the suggestion. I updated it with more explicit
         | citations. After seeing your comment, I just looked around and
         | saw that no license was given for the images.
         | 
         | I should probably draw my own.
        
       | nestorD wrote:
       | Note that there is quite a bit of thinking and available
       | implementations of GC implementations available in Rust:
       | https://manishearth.github.io/blog/2021/04/05/a-tour-of-safe...
        
       | ajeetdsouza wrote:
       | This is really well-written!
       | 
       | Shameless plug: you may want to check out Loxcraft:
       | https://github.com/ajeetdsouza/loxcraft
       | 
       | I too followed the path of "ignore safe Rust for maximum
       | performance". It got pretty close to the C version, even beating
       | it on some benchmarks.
        
       ___________________________________________________________________
       (page generated 2024-07-31 23:00 UTC)