[HN Gopher] Mugo, a toy compiler for a subset of Go that can com...
       ___________________________________________________________________
        
       Mugo, a toy compiler for a subset of Go that can compile itself
        
       Author : benhoyt
       Score  : 155 points
       Date   : 2021-04-12 10:01 UTC (12 hours ago)
        
 (HTM) web link (benhoyt.com)
 (TXT) w3m dump (benhoyt.com)
        
       | chrfrasco wrote:
       | This is so cool. I especially loved the Makefile, it's nice to
       | see "bootstrapping" laid out so plainly. I remember finding the
       | concept quite hard to wrap my head around when I was introduced
       | to it. Seeing this would have made it a lot easier to understand!
        
         | chrfrasco wrote:
         | > I wonder if we could reduce binary size with a dead-simple
         | scheduler and GC
         | 
         | For many CLIs I think even a brain-dead bump allocator would
         | work
        
           | Hello71 wrote:
           | musl automatically uses a bump allocator if free is not
           | referenced in the program.
        
             | KMag wrote:
             | That's really cool!
             | 
             | Though, I'm curious... don't a lot of malloc
             | implementations use a bump allocator if a simple
             | fragmentation heuristic is below some limit? Presumably
             | musl down inside malloc() has a static bool (or a static
             | function pointer) it uses to keep track if dlsym() has ever
             | returned the address of free(). How much faster is the musl
             | implementation than an implementation using a simple
             | fragmentation heuristic? Presumably, they're both well-
             | predicted conditional branches (and/or indirect branch
             | targets well predicted by the BTB).
        
             | benhoyt wrote:
             | Interesting. I couldn't find any documentation for how this
             | happens (does it need compiler/linker support to know
             | whether free is used?), but I did find the source code for
             | this __simple_malloc: https://git.musl-
             | libc.org/cgit/musl/tree/src/malloc/lite_mal...
        
           | tetha wrote:
           | A programming language professor working on interpreters once
           | said - for short-living processes the most effective and
           | efficient garbage processing might be... not to and
           | terminating the process. The OS will collect all garbage at
           | once very quickly. So why bother time spending being smart?
        
         | dane-pgp wrote:
         | If you like well-documented bootstrap processes, I can
         | recommend this:
         | 
         | https://github.com/fosslinux/live-bootstrap/blob/master/part...
         | 
         | The steps go from a binary seed of a few hundred bytes to
         | (almost) gcc. It is still being actively developed as part of
         | the Bootstrappable Builds project:
         | 
         | https://www.bootstrappable.org/
        
       | jeffrallen wrote:
       | This is excellent! And a great way to understand what's the
       | simplest a computer and a programming language can be. Students
       | should puzzle their way through exercises built on systems like
       | this.
        
       | dvfjsdhgfv wrote:
       | Being able to compile itself is actually a huge milestone,
       | congratulations!
        
       | jhgb wrote:
       | > whereas mine is 1600 lines of formatted Go.
       | 
       | I wonder how the whole thing compares with the newest Oberon
       | (which should be at around 2900 lines of code or so). After all,
       | if you drink enough beers, Oberon might look like "a subset of Go
       | that can compile itself" to you.
        
         | benhoyt wrote:
         | Wow, I had no idea the Oberon implementation was so small,
         | thanks. And it looks like it's implemented in a similar way,
         | with parsing and code generation mingled together. It is ~2900
         | lines of code, though it's very dense. You can see it here:
         | http://www.projectoberon.com/                  ORB - symbol
         | table handling - 430 lines        ORG - code generator (for
         | RISC) - 1115 lines        ORP - parser (calls scanner and code
         | generator - 997 lines        ORS - scanner/lexer - 312 lines
        
       | iovrthoughtthis wrote:
       | great post, this part
       | 
       | > For reference, a Python version of this same loop runs in 1
       | minute and 38 seconds ... a dynamically-typed bytecode
       | interpreter is not a good choice for heavy integer arithmetic.
       | 
       | made me wonder about other dynamic langs.
       | 
       | ruby (3.0.0) and lua (5.4.2) ran it in 1:30 and 1:03 respectively
       | but node (12.22.0) ran it in 0:01. how is node so quick?
        
         | elcomet wrote:
         | Node uses a JIT compiler (V8) and is not interpreted. This is
         | why it's so fast.
         | 
         | You could compare it with PyPy, a just-in-time compiler for
         | python.
         | 
         | For example, on my machine, the same program runs in 2:12 min
         | with the regular python interpreter, and in 1.57 seconds with
         | pypy.
        
         | inbx0 wrote:
         | Well, for one thing, if you copied the numbers from the example
         | as they are defined, then JS will be dealing with simple
         | floating point numbers and actually gets the answer wrong (but
         | gets there fast!). The answer is beyond 5E+17 but JS's
         | Number.MAX_SAFE_INTEGER is only 9E+15. The results start
         | getting weird after that.
         | 
         | To get the correct answer, we need to convert some of the
         | numbers to BigInts                 var result = 0n
         | function main() {         var sum = 0n         var i = 1
         | while (i <= 1000000000) {             sum = sum + BigInt(i)
         | i = i + 1         }         result = sum       }
         | main()
         | 
         | but after that the result is not quite so impressive anymore
         | Now using node v15.6.0 (npm v7.4.0)        time node index.js
         | node index.js  51.50s user 0.40s system 103% cpu 50.072 total
        
           | zeugmasyllepsis wrote:
           | Thank you for providing details and not just assuming the JIT
           | was the cause of the difference.
        
         | josefx wrote:
         | Ruby has its jit disabled by default as far as I can tell^1 .
         | Lua exists in both jit compiled and interpreted forms. Python
         | definitely is interpreted unless you run it through pypy.
         | JavaScript is basically the only language in that lineup that
         | enables just in time compilation with most of the "default"
         | runtimes.
         | 
         | ^1 Most likely because it generates C code and has to call an
         | external C compiler to get things done.
        
         | deweller wrote:
         | For reference, this runs in 13s in PHP 8 on my ARM Mac. PHP 8
         | uses a JIT compiler.
        
         | thysultan wrote:
         | NodeJs and Go can and probably are optimizing the loop away,
         | the authors attempt to assign the sum to result does not
         | prevent this from happening.
        
           | benhoyt wrote:
           | > can and probably are optimizing the loop away
           | 
           | Why do you think that, and how would it do that and still do
           | the calculation? My timing tests showed that Go didn't
           | optimize the loop away (if I counted to 10 billion it took
           | about 10 times as long), and godbolt.org also shows the loop
           | in Go's assembly output: https://www.godbolt.org/z/d38xfGYr6
        
             | makapuf wrote:
             | Fun fact: in C, gcc and clang can(and will) replace the sum
             | of n first integers with a loop (and slight variations)
             | with Euler formula n(n+1)/2. Impressive.
        
               | benhoyt wrote:
               | Wow, that is impressive, thanks. I confirmed this with
               | godbolt.org at -O2.
        
         | Joker_vD wrote:
         | Erlang/OTP 22 (on my machine(tm), of course) runs it in 9.3
         | seconds when invoked from the compiled module, and in at least
         | in 5 minutes when invoked as a fun defined in the shell: "at
         | least" because I did not have enough patience to wait until it
         | actually completes.
        
         | jerf wrote:
         | If you think about what something like CPython is doing, you
         | are maximally pessimizing the way the implementation works by
         | adding integers together. You get all the dynamic type
         | manipulation and management that is occurring all the time, but
         | then when it comes time to do the 'payload' of the operation it
         | is doing, it's a simple ADD operation that can complete in
         | (amortized) less than one cycle. It's the worst case for
         | something like CPython. All that work a dynamic language is
         | doing has much better performance characteristics when you get
         | to a "payload" that is like running C code to find a substring
         | match in a string, or something else non-trivial that helps to
         | amortize all the bookkeeping with more real work.
         | 
         | As for why Node runs quickly, adding integers (or floats or
         | whatever) is easy mode for a JIT. This is basically why you see
         | JIT numbers like speeding up the original implementation by
         | factors of literally over 100x, yet you can't expect that
         | multiplier against "real code", for essentially the same reason
         | only now on the other end; the JIT can very successfully cut
         | out all that work CPython or similar implementatiors are doing
         | if you have completely stable types that it can infer
         | (especially something as simple as a single int or float), but
         | now the "payload" is _preventing_ the JIT from posting big
         | numbers in improvement when you have more complicated
         | operations because the JIT isn 't going to be able to improve
         | something like a substring search. If you get something like
         | NumPy where you can feasibly write a program that will spend
         | 99+% of its time in highly optimized matrix code, a JIT can't
         | help much with that, because even if it reduced the bookkeeping
         | to zero, you've still got to do all that matrix work.
         | 
         | Statically-typed compiled languages basically bypass the whole
         | "do lots of expensive bookkeeping to permit very flexble coding
         | styles, then write a JIT that figures out how to bypass that
         | expensive bookkeeping as often as possible" to simply constrain
         | you to write in the fast subset in the first place.
        
         | sfifs wrote:
         | Years of Google engineers squeezing performance for JavaScript
         | heavy web pages maybe?
        
           | iovrthoughtthis wrote:
           | That is certainly how, but why? What have they done that
           | enables it to do that?
        
             | pjmlp wrote:
             | Here, have some fun,
             | 
             | "V8 Internals For JavaScript Developers"
             | 
             | https://vimeo.com/312113316
             | 
             | https://v8.dev/blog
        
             | cdogl wrote:
             | Chrome's V8 engine made quite a stir around the turn of the
             | last decade because it it compiled JavaScript down to
             | native code. It would even watch how code was called (if it
             | couldn't statically figure out a safe way to compile it)
             | and generate native code for the happy paths in hot code
             | paths. I simply assume that all browsers do this now
             | because we take it for granted now that JS is reasonably
             | fast, but I could be wrong.
             | 
             | I credit much of Chrome's success to V8. It's a marvel of
             | engineering. Shame that the rest of Chrome hasn't been held
             | to such a high standard.
        
               | josefx wrote:
               | > I simply assume that all browsers do this now because
               | we take it for granted now that JS is reasonably fast,
               | but I could be wrong.
               | 
               | Chrome came out a month after Mozilla published its
               | TraceMonkey JIT. So JITing JavaScript has been the
               | standard thing to do since the end of 2008.
        
               | KMag wrote:
               | ... and TraceMonkey was based on open-sourced
               | Macromedia/Adobe code from the ActionScript 3 VM. AS3 was
               | loosely based on a draft of the ill-fated ES 4, but I'm
               | not sure if it was a strict superset of an earlier
               | ECMAScript standard. It's possible that the Flash AS3 VM
               | had been JITing all of ES2 or ES3 earlier than
               | TraceMonkey came out.
        
       | [deleted]
        
       | dimes wrote:
       | That's amazing! Congrats. One question: why is var only supported
       | for top level declarations? I'm not a compiler or programming
       | language expert by any means, but I thought the general approach
       | was to define a grammar and write a parser for it. I'm curious
       | how global and local declarations diverged so significantly.
        
         | dahfizz wrote:
         | This is a common simplification in compilers. Even C, until
         | 1999, forced you to declare your variables at the top of the
         | function before any code.
         | 
         | Its not a limitation of the parser, but simply that having
         | variable declarations anywhere makes the compiler more complex.
         | Its harder to calculate stack space / stack pointers when the
         | variables are declared all over the place.
        
           | Joker_vD wrote:
           | To clarify why the stack usage calculation is hard when local
           | variable declarations are allowed everywhere, consider
           | compiling this example code:                   int a;
           | if (...) {             int b;             int c;
           | if (...) {                 int d;             } else {
           | int e;                 int f;             }             int
           | g;         } else {             int h;         }         int
           | i;
           | 
           | Let's start assigning offsets to the variables, starting at
           | offset 0. The "a" variable gets 0, "b" gets 1, "c" gets 2,
           | "d" gets 3, "e" gets 4, "f" gets 5, "g" gets 6, "h" gets 7,
           | and "i" gets offset 8, so we need to allocate 9 * sizeof(int)
           | bytes on the stack for the local variables.
           | 
           | Wait, there is no complication at all, so what's the problem?
           | Well, the problem is unoptimal stack usage: at any point,
           | only 5 ints at most are alive, so you can alias "b" with "h"
           | and "i", and alias "d" with "e" and "g", and save 4 *
           | sizeof(int) bytes of the stack. But to do this you need to
           | track the nesting levels (so you could reset the current
           | allocated offset back to the one used at the start of the
           | block) -- and if you're writing a one-pass recursive-descent
           | compiler, you get this for (almost) free anyway... huh.
           | 
           | Did I miss something? It seems there is actually no
           | complication at all, just unwillingness to implement such a
           | feature.
        
             | Someone wrote:
             | You 'have' to go back to the start of your function to fill
             | in the correct size of the stack frame (An alternative is
             | to make a stack frame for every variable or basic block.
             | That's a waste of cycles and code, especially on early
             | machines)
             | 
             | Again, the solution to that shouldn't stop you, as it is
             | similar to how you handle filling in the branch offsets in
             | forward branches in if/then/else, and that's something you
             | have to do, anyways (yes, you can stream out your object
             | code without going back and filling in branch offsets but
             | that leads to slow, ugly code)
             | 
             | I think the reason this _was_ done in old languages is that
             | parsing and compiling were black arts when those languages
             | were created. COBOL and FORTRAN didn't use stack frames,
             | Algol invented per-function stack frames and later loosened
             | that to per-block ones. That's what K&R C did, too.
             | 
             | I don't know which language invented the idea that you
             | could lossen that further, but it might have been C++.
        
             | [deleted]
        
             | yencabulator wrote:
             | Now do that with the rest of the compilation in one pass,
             | without an AST.
        
               | Someone wrote:
               | If you target a simple CPU (no out-of-order execution, no
               | pipelines), and are ignoring performance (as early C
               | compilers did. If you wanted performance, you inserted
               | 'register' in the right places (and you knew what the
               | right places where because you wrote the compiler or were
               | setting next to the guy who did), or dropped down to
               | assembly), it's not that difficult.
               | 
               | The tricky bits are the stack offsets to local variables.
               | If you compile                 int a = foo()       int b
               | = a + 1;
               | 
               | to                 call foo       push       add 1
               | push
               | 
               | the addresses of variables relative to the stack pointer
               | change when you create new locals, when locals go out of
               | scope, or when you push function arguments onto the
               | stack. The trick is for the compiler to track the address
               | of locals relative to the top of the stack frame and keep
               | track of the depth of the stack frame. Then, when
               | generating code to access a stack-relative location,
               | combine the two to translate 'relative to top of stack
               | frame' to 'relative to stack pointer'. A bit tedious, but
               | doable.
               | 
               | Now, when you want performance (especially on modern
               | CPUs), and don't want to use stack space for variables
               | kept in registers, things do get significantly more
               | complicated.
        
           | makapuf wrote:
           | But here its only the var syntax that is not allowed, the
           | newvar:=someexpression() is allowed, having the same effect
           | of declaring a new local variable if I understand correctly?
           | Edit: typo
        
             | yencabulator wrote:
             | I think it's just a case of keeping it simple. Basically,
             | it looks like Mugo supports what Mugo consumes (and about
             | two simple additions). Adding local `var` parsing likely
             | would complicate the hand-crafted parser.
        
               | benhoyt wrote:
               | Yes, that's correct. I actually initially implemented
               | local "var" parsing, but was never using it in the
               | compiler, so I removed it to trim it down and keep things
               | as simple as reasonably possible.
        
       | userbinator wrote:
       | _Fabrice's C compiler is implemented in 2048 bytes of obfuscated
       | C, whereas mine is 1600 lines of formatted Go._
       | 
       | "It can compile itself" is an interesting benchmark of language
       | complexity, and we now have of C and Go, so I wonder how the
       | other common languages would compare. Of course you can "cheat"
       | by choosing a subset that reduces the complexity of the whole.
       | 
       |  _As you can see, Mugo uses its own very inefficient ABI that's
       | nothing like the x86-64 ABI - the standard ABI puts the first six
       | "cells" (64-bit values) in registers._
       | 
       | Considering that stacking args is the norm for 32-bit x86 and few
       | other architectures, it's not that bad, especially since CPUs
       | have special hardware to handle push/pop.
        
         | pjmlp wrote:
         | Most compiled languages are bootstraped, how come "we now have
         | of C and Go" ?
        
           | userbinator wrote:
           | Can you point to any other languages' examples of minimal
           | self-compiling compilers?
        
             | eat_veggies wrote:
             | Here's a (probably non exhaustive) list I found
             | 
             | https://en.wikipedia.org/wiki/Self-
             | hosting_(compilers)#List_...
        
       ___________________________________________________________________
       (page generated 2021-04-12 23:01 UTC)