[HN Gopher] Crafting Interpreters with Rust: On Garbage Collection
___________________________________________________________________
Crafting Interpreters with Rust: On Garbage Collection
Author : amalinovic
Score : 164 points
Date : 2024-07-30 12:57 UTC (1 days ago)
(HTM) web link (www.tunglevo.com)
(TXT) w3m dump (www.tunglevo.com)
| chc4 wrote:
| The other major issue with the Deref implementation is that
| `&mut` needs to be an exclusive reference, and if you're doing
| mark/sweep of GC objects via references you break that invariant
| if you hold any `&mut` across a GC call and are causing UB. In
| practice this probably doesn't affect your program, but I suspect
| Miri would yell at you and there's the off chance the compiler
| gets really tricky with inlining somewhere and you are inflicted
| with horrible pain.
| ltungv wrote:
| Yeah true, this breaks all sorts of contracts that we have with
| Rust xD. But if mark-and-sweep is implemented correctly, then
| no reference is ever held across the GC. Though, there's gonna
| be lots of pain debugging when you got it wrong, speaking from
| experience.
|
| Do you have any resources about Rust inlining and the issues it
| might cause? I'd love to read more about that.
| galangalalgol wrote:
| With a borrowchecker what does GC let you do that is more
| ergonomic? I have never used a borrowchecked and garbage
| colected language so I have no experience to consult.
| ltungv wrote:
| The borrow-checker helps when you're writing Rust code. But
| when writing an interpreter for another language, you kinda
| have to support its semantics. In Lox, there's no move
| semantic, no borrowing, almost everything is a heap-
| allocated object, variables can alias, etc. Thus, you need
| to have a way to manage the memory of the implemented
| language without the borrow-checker. Here, the borrow-
| checker can help with implementing the GC safely and
| correctly, but I didn't utilize it.
| chc4 wrote:
| Mark and sweep doesn't stop you from holding references
| across GC.
|
| If you write e.g.
|
| ``` let obj = some_object(); let len : &mut usize = &mut
| obj.len; // deref_mut trigger_gc(); use(*len); ```
|
| then you held a reference across the GC, and while it's
| mark/sweeping created an aliases `&mut` to `len`.
|
| Inlining was mention just because it causes function bodies
| to be splatting together, and so puts together code that is
| fine in isolation in a way that may allow Rust to observe UB:
| if `trigger_gc` was inlined for example then Rust has more
| information and code available at once, and might use the
| knowledge to apply some optimizations that are invalid
| because you caused UB.
|
| Actually, looking at your code the much larger issue is that
| nothing stops you from doing
|
| ``` let obj = some_object(); let obj2 = obj.clone(); let len1
| = &mut obj.len; let len2 = &mut obj2.len; let n = *len1;
| *len2 = 0; println!("{n}"); // what does this print? ```
| because your Deref mut are just creating unconstrained borrow
| from pointer types that you can copy. This is almost
| definitely going to cause you bugs in the future, since it
| opens you up to accidentally writing iterator invalidation
| and other issues (and even worse than C++ does, because C++
| doesn't optimize references as heavily as Rust does borrows)
| ltungv wrote:
| Yeah, it's a self-enforced contract that objects of type
| Gc<T> are only ever touched by the virtual machine, but I
| get your point.
| adrian17 wrote:
| The issue is that it's a simple footgun as soon as you
| start adding non-trivial native methods (say, string
| methods, array.map etc). The only way to make sure that
| they don't exist is removing the DerefMut impl entirely.
|
| It's not just that it's possible to break the code, but
| that lack of any checks makes it impossible to detect at
| either compile-time or runtime and have it "appear to
| work".
|
| One way to solve it to remove the DerefMut implementation
| and have it work more like Rc - as in, the user is forced
| to write `Gc<Cell<T>>` or `Gc<RefCell<T>>` if they want
| mutability. This solves the aliasing borrows issue at the
| cost of extra runtime checks with RefCell (and still
| doesn't prevent you from unsoundly calling `sweep()`
| while holding a Gc)
| ltungv wrote:
| I implemented exactly what you were saying here (https://
| github.com/ltungv/rox/commit/6a611e7acb3b36d0a3a4376...).
| But where's the fun in that?
| ekidd wrote:
| I would be really careful with those deref methods. They
| return references, not pointers, which means you need to
| follow the Rust rules:
|
| - You can have any number of simultaneous readers,
|
| - _Or_ one writer and no readers.
|
| - _But_ if you ever break these rules, the world may burn.
|
| Using unsafe and violating these rules is one of the cases
| where the Rust compiler can inflict worlds of misery:
| Incorrect code generation, CPUs seeing different versions of
| the same memory location, etc., depending on the exact
| context. Once you use "unsafe" and break the rules, Rust can
| be more dangerous than C. Rust even reserves the right to
| generate code using "noalias" (except doing so often triggers
| LLVM bugs, so it's usually turned off).
|
| "Undefined behavior" means "anything at all can happen, and
| some of it is deeply weird and awful and counterintuitive."
|
| You could enforce the borrowing rules at runtime by using
| std::cell::Cell in your heap objects, which is exactly what
| it exists for. Or you could package everything inside a tiny
| core module and audit it extremely carefully.
| ltungv wrote:
| I'm aware of all the issues mentioned. But for this
| particular project, I simply don't care as long as it
| passes Lox's test suite xD. I went this path just to see
| how easy it is to get tripped by unsafe while knowing that
| there's a technique to get safety with Pin<T> that this can
| get refactored into. I actually implemented this with Cell
| and RefCell but didn't find that interesting.
| slaymaker1907 wrote:
| You would probably want to use RefCell instead of Cell. It
| allows you to safely upgrade into a &mut using only a
| constant reference to RefCell, but it dynamically verifies
| that it's actually safe using ref counting. The ref
| counting also isn't too expensive since it isn't atomic.
| jcdavis wrote:
| Fun writeup. When I went through and implement the book in rust
| (https://github.com/jcdavis/rulox), I just used Rc and never
| really solved the cycle issue.
|
| I'll +1 and say I highly recommend going through Crafting
| Interpreters, particularly as a way of building non-trivial
| programs in languages you are less familiar with - If you just
| follow the java/C example you are tempted to lean into
| copy/pasting the samples, but figuring things out in other
| languages is a great experience.
|
| I spent a longass time tracking down an issue with how the
| reference interpreter implementation defines tokens:
| https://github.com/munificent/craftinginterpreters/issues/11...
| which was frustrating but good debugging practice.
| lawn wrote:
| > If you just follow the java/C example you are tempted to lean
| into copy/pasting the samples, but figuring things out in other
| languages is a great experience.
|
| I'll echo this recommendation.
|
| I haven't gone through Crafting Interpreters yet (it's on the
| bookshelf) but I did go through a big chunk of Building Git and
| instead of Ruby I did it in Rust, to get back into the language
| and go a little deeper.
|
| It was really great and it gave me a lot more than if I'd just
| copy paste the code as you say.
| ghosty141 wrote:
| My only issue with Crafting Interpreters is that it relies too
| much on writing code and then explaining it instead of getting
| the concepts across first and then providing the code. Another
| thing I disliked was some code was structured in a way that it
| would go well with concepts that came further down the
| implementation which can lead to confusion since it seems
| overcomplicated/overabstracted at first.
| grumpyprole wrote:
| My minor nit pick is that it uses mostly the Java language to
| implement a Java-like language. What have we gained? Surely
| something Python-like in C would be more rewarding, one would
| be gaining a level of abstraction.
| munificent wrote:
| _> Surely something Python-like in C would be more
| rewarding, one would be gaining a level of abstraction._
|
| That's the second half of the book.
| grumpyprole wrote:
| My apologies, that does indeed seem like a reasonable way
| of doing it.
| munificent wrote:
| One of the biggest challenges with writing is linearizing:
| deciding what order to present the material so that it makes
| the most sense.
|
| There's really no perfect solution. Some readers prefer
| bottom up where they are given individual pieces they can
| understand which then get composed. Others prefer top down
| where they are given the high level problem being solved in
| terms of intermediate steps which then get incrementally
| defined. Some want to see code first so that they have
| something concrete to hang concepts off of. Others want
| concepts first and then to see the code as an illustration of
| it.
|
| This is made even harder with Crafting Interpreters because
| the book's central conceit is that every line of code in the
| interpreters is shown in the book. There's nothing left to
| the reader. And, also, the book tries to give the user a
| program they can compile and run as often as possible as they
| work through things.
|
| I did the best I could, but there are definitely some places
| where it's really hard to figure out a good order. Sometimes
| the book explicitly says "sorry, this won't make sense but
| bear with me and we'll circle back".
|
| Also, at the macro structure, the entire book is organized
| into two parts where the first one is higher-level and more
| concept driven and then the second part circles all the way
| back to build the entire language from scratch again but at a
| much lower level of abstraction.
|
| I appreciate your feedback. It's a hard problem and a skill
| I'm always trying to improve.
| scott_s wrote:
| I _love_ the code-first approach. It was a revelation to me
| when I first read Mark Pilgrim 's Dive Into Python
| (https://diveintopython3.net/).
| Dowwie wrote:
| did you not want anyone to ever find rulox on github? it has no
| descriptive information about it in the repo
| jlewallen wrote:
| Crafting Interpreters is such an amazing work.
|
| There's at least one other Rust implementation of lox that I know
| of (https://github.com/tdp2110/crafting-interpreters-rs) (no
| unsafe)
|
| It's always interesting to see how different people approach the
| problems in their own language or relative isolation. I agree
| with others here, the real value of the original work lies in
| avoiding copy and paste.
| munificent wrote:
| There are a whole pile of Lox implementations in Rust (as well
| as many other lanugages):
|
| https://github.com/munificent/craftinginterpreters/wiki/Lox-...
| timsneath wrote:
| Lox must have the highest ratio of (implementations :
| production usage) of any language on the planet. And I mean
| that as the highest praise -- it's proven a fantastic
| teaching language, and your book is infectious at encouraging
| others to jump in and follow along in various different
| languages.
|
| I've also found the exercise of implementing Lox in another
| language as highly instructive in learning how to write
| idiomatic code in that language. I continue to learn more
| about the best way to express patterns as I work on my own
| implementation. I'd recommend the journey to any professional
| developer for this side-benefit alone.
| stevekemp wrote:
| > Lox must have the highest ratio of (implementations :
| > production usage) of any language on the planet.
|
| It's probably up there, for sure! But I'd guess that there
| are a million toy Lisp implementations, and more people are
| interested in writing a FORTH interpreter than actually
| using one in production. So I'd guess if we tried to get
| statistics it wouldn't be at the top.
|
| Though there's probably a similar claim to be made for the
| Monkey-language from Torsten Bell, via his books on
| compilers and interpreters.
|
| https://monkeylang.org/
| munificent wrote:
| _> Lox must have the highest ratio of (implementations :
| production usage) of any language on the planet._
|
| Maybe! It's definitely getting there. I suspect "semi-
| arbitrary subset of C" still has me beat but who knows for
| how much longer.
| leftyspook wrote:
| All praise Bob!
|
| On a more serious note, have you thought about trying to
| aim lightning at the same spot again and write another
| book about implementing something most programmers take
| for granted?
| munificent wrote:
| I've definitely thought about writing a third book. I
| don't know if it would be about "something most
| programmers take for granted". I'm more interested in
| writing about whatever happens to excite me the most at
| that time.
| diffxx wrote:
| I'd imagine cool is up there as well:
| https://nguyenthanhvuh.github.io/class-compilers/cool.html
| samsartor wrote:
| I have run into a lot of similar problems writing a state
| management framework for Rust (wip at
| https://gitlab.com/samsartor/hornpipe) and at one point despaired
| that I would have to abandon Rust and make a whole new reactive
| language. But the last couple years I've got it working really
| nicely with weak references and software transactional memory.
| Every reference is `Copy + Send + Sync + 'static`. And you can
| mutate objects by having a transaction make a local bitwise copy,
| which will get atomically merged and swapped on commit. The old
| copy gets kept around for undo/redo purposes. I've still got a
| boatload of algorithmic puzzles to solve to provide all the MVP
| features, which will take a while because it isn't my full-time
| job. But the details seem technically sound.
|
| I did write a blog post about some of my design thoughts,
| although it doesn't dig into the technical guts:
| https://samsartor.com/guis-3/
| ceronman wrote:
| Nice article. A couple of years ago I also implemented Lox in
| Rust. And I faced the exact same issues that the author describes
| here, and I also ended up with a very similar implementation.
|
| I also wrote a post about it: https://ceronman.com/2021/07/22/my-
| experience-crafting-an-in...
|
| I ended up having two implementations: One in purely safe Rust
| and another one with unsafe.
|
| Note that if you're interesting in the "object manager" approach
| mentioned, I did that in my safe implementation, you can take a
| look at https://github.com/ceronman/loxido
| ltungv wrote:
| I'd read your article, and it was lovely. It nudged me to just
| go unsafe and implement some of the data structures from
| scratch.
| Daffodils wrote:
| Just wanted to share that the Build your own Interpreters
| challenge on CodeCrafters is based on the Crafting Interpreters
| book and is free at the moment.
|
| Link: https://app.codecrafters.io/courses/interpreter/overview
|
| (Disclaimer: I work for CodeCrafters, and built this challenge).
| bombela wrote:
| > How much memory is used for the stack and when it is freed can
| always be determined at compile-time.
|
| Technically, not always. You can allocate on the stack at
| runtime, see "alloca" in C.
|
| But is alloca even that useful in practice? I struggle to find a
| good example.
| tick_tock_tick wrote:
| I mean it's basically the fastest allocation possible. It's
| amazing under a very very unique situation.
|
| - you need a lot of memory
|
| - it needs to be as fast as possible
|
| - it's a leaf function that can never call another function
|
| 99.9% of the time those aren't all true.
| alexchamberlain wrote:
| Why does it have to be a leaf function?
| jfoutz wrote:
| So the massive stack grab isn't permanent.
|
| Conceptually, if the compiler could just inline all of the
| child calls, it's fine. But if this call won't return for a
| long time, or may be called again (even worse). You're
| setting aside stack that you'll never get back. Which is a
| memory leak, and your program will crash.
| alexchamberlain wrote:
| That reasoning leads me to conclude the function should
| be short lived or the memory in use by all child
| functions, rather than it has to be a leaf function?
| jfoutz wrote:
| yeah. Leaf is a good rule of thumb, but if you need a
| little helper to do something it's generally fine.
|
| you really do need confidence the little helper can be
| fully inlined. if the helper is out of your control, and
| someone greatly expands its responsibility, it can suck.
|
| So, yeah, effectively a leaf.
| bombela wrote:
| The stack grab is as permanent as any other stack
| variable. There is nothing special about it. The only
| difference is that you can decide at runtime how much to
| add to the current stack frame.
| mypalmike wrote:
| If the memory's lifetime correctly coincides with the
| function (which is why you might use alloca in the first
| place), I don't see how this would be a memory leak nor
| why it would lead to a crash. Maybe on systems which
| limit stack size...?
| slaymaker1907 wrote:
| There's another reason: we need to do string stuff and we
| absolutely can't stall. Malloc always has some chance of
| stalling due to heap fragmentation. The leaf function
| requirement is also not necessary. You just have to make sure
| you don't alloca too much memory given how much is remaining
| on the stack and roughly how much memory you know subsequent
| calls will take.
|
| It's not all that much worse than using a huge, static stack
| allocation. In fact, there are cases where alloca is safer
| since you can take into account how much stack memory you
| already have available before incrementing the stack pointer.
| ltungv wrote:
| Oh right. I read about this once or twice and have never looked
| at it again. I've made the (wrong) assumption that this
| function needs some compile-time help. Is it all done at
| runtime? Or does that depend on the actual C implementation?
| bombela wrote:
| Stack allocation is always done at runtime. Baring special
| optimizations, the CPU and the code work together to (stack
| grows downward) decrement the CPU stack pointer register.
| This register keeps track of where the bottom of the stack
| is.
|
| What is done at compile time by the compiler, is calculating
| the amount to decrement the CPU stack pointer register at
| runtime for every stack variables.
|
| The compiler and alloca must work together, because
| optimization can easily break something here. For example,
| when a function return, the stack pointer register better be
| restored properly. Furthermore, in Rust you must not forget
| to call Drop::drop() for all types that have such a
| destructor.
| ozgrakkurt wrote:
| This was used in linux kernel and they were removing it later
| because it was a huge pain as far as I can remember. Seems like
| it causes problems because entire system is built assuming
| stack/heap work in a certain way.
| kaspermarstal wrote:
| > Don't ask me about covariance since I honestly don't know the
| answer. I recommend reading the section on subtyping and variance
| and The Rustonomicon. I went through them a few times but still
| don't entirely get all the details. Then, we can define each type
| of object like so:
|
| Haha this is gold. We ALL do this ALL the time.
| lowbloodsugar wrote:
| TL;DR: Don't use unsafe to break the rules. Use unsafe to enforce
| the rules in new ways.
|
| Great journey. That's the thing about "unsafe" and "questionable
| parts"... if they really are questionable, then don't do 'em. In
| this case, if it is the case that holding a reference to a GC
| object guarantees that that object can't be freed, then it's not
| questionable. Proving that to be case can be fun tho.
|
| The question is: does my unsafe code allow violating the borrow
| rules?
|
| Cell, RefCell, RwLock etc give you interior mutability, but there
| is no way to violate the borrow rules with them. On the surface,
| it looks like you are violating them because "I have a non-mut
| thing that I can mutate".
|
| If you build a thing that allows two distinct &mut T's to the
| same thing, then you are basically asking for a bug. Don't use
| unsafe to break the rules. Use unsafe to enforce the rules in new
| ways.
| ltungv wrote:
| If someone else depends on this project, I'd definitely not
| implement the questionable stuff. You're right that proving
| whether it's safe can be lots of fun, and I'm planning to try
| it out.
|
| Based on what I've read, it's entirely possible. If my Rust
| knowledge is correct, there are 2 things that have to be proven
| - the object's lifetime and the borrow rules.
|
| Proving the lifetime can be done at compile-time based on what
| was discussed in these articles
| https://without.boats/tags/shifgrethor/. But that still leaves
| us with proving that we don't violate the borrow rules. AFAIK,
| the only way to do it in my implementation is with Cell and
| RefCell enforcing the borrow rules at runtime.
|
| Before removing all the safety checks, I actually used
| Gc<RefCell<T>> for mutable objects, so the borrow rules were
| enforced but not the lifetime. However, I decided to remove
| RefCell<T> to have some fun and see how I could shoot myself in
| the foot.
| adrian17 wrote:
| Just for reference, the gc-arena crate
| (https://github.com/kyren/gc-arena) we use in Ruffle's
| Actionscript 1/2/3 interpreters and across the entire engine,
| does this. As far as we know, it's safe and sound, at the
| cost of some limitations, mainly: to compile-time prevent
| sweeping while holding onto Gc references, every struct and
| function touching Gc objects passes an extra 'gc lifetime.
| And not being able to run the GC at arbitrary time can be
| major issue depending on what you're doing; in Ruffle we
| incrementally collect between frames and pray that no single
| frame will ever allocate enough to OOM in the middle of it :)
|
| And yeah, RefCells everywhere.
| scott_s wrote:
| In case the author reads this: please explicitly cite all of
| Nystrom's figures. A link is not enough.
|
| Even with a citation, I'm not quite comfortable just reusing
| someone else's figures so many times when they're doing so much
| heavy lifting. But an explicit citation is the minimum.
| imachine1980_ wrote:
| idk if he/she change it, but i see the name big and center
| after each use.
| ltungv wrote:
| Thanks for the suggestion. I updated it with more explicit
| citations. After seeing your comment, I just looked around and
| saw that no license was given for the images.
|
| I should probably draw my own.
| nestorD wrote:
| Note that there is quite a bit of thinking and available
| implementations of GC implementations available in Rust:
| https://manishearth.github.io/blog/2021/04/05/a-tour-of-safe...
| ajeetdsouza wrote:
| This is really well-written!
|
| Shameless plug: you may want to check out Loxcraft:
| https://github.com/ajeetdsouza/loxcraft
|
| I too followed the path of "ignore safe Rust for maximum
| performance". It got pretty close to the C version, even beating
| it on some benchmarks.
___________________________________________________________________
(page generated 2024-07-31 23:00 UTC)