[HN Gopher] Mugo, a toy compiler for a subset of Go that can com...
___________________________________________________________________
Mugo, a toy compiler for a subset of Go that can compile itself
Author : benhoyt
Score : 155 points
Date : 2021-04-12 10:01 UTC (12 hours ago)
(HTM) web link (benhoyt.com)
(TXT) w3m dump (benhoyt.com)
| chrfrasco wrote:
| This is so cool. I especially loved the Makefile, it's nice to
| see "bootstrapping" laid out so plainly. I remember finding the
| concept quite hard to wrap my head around when I was introduced
| to it. Seeing this would have made it a lot easier to understand!
| chrfrasco wrote:
| > I wonder if we could reduce binary size with a dead-simple
| scheduler and GC
|
| For many CLIs I think even a brain-dead bump allocator would
| work
| Hello71 wrote:
| musl automatically uses a bump allocator if free is not
| referenced in the program.
| KMag wrote:
| That's really cool!
|
| Though, I'm curious... don't a lot of malloc
| implementations use a bump allocator if a simple
| fragmentation heuristic is below some limit? Presumably
| musl down inside malloc() has a static bool (or a static
| function pointer) it uses to keep track if dlsym() has ever
| returned the address of free(). How much faster is the musl
| implementation than an implementation using a simple
| fragmentation heuristic? Presumably, they're both well-
| predicted conditional branches (and/or indirect branch
| targets well predicted by the BTB).
| benhoyt wrote:
| Interesting. I couldn't find any documentation for how this
| happens (does it need compiler/linker support to know
| whether free is used?), but I did find the source code for
| this __simple_malloc: https://git.musl-
| libc.org/cgit/musl/tree/src/malloc/lite_mal...
| tetha wrote:
| A programming language professor working on interpreters once
| said - for short-living processes the most effective and
| efficient garbage processing might be... not to and
| terminating the process. The OS will collect all garbage at
| once very quickly. So why bother time spending being smart?
| dane-pgp wrote:
| If you like well-documented bootstrap processes, I can
| recommend this:
|
| https://github.com/fosslinux/live-bootstrap/blob/master/part...
|
| The steps go from a binary seed of a few hundred bytes to
| (almost) gcc. It is still being actively developed as part of
| the Bootstrappable Builds project:
|
| https://www.bootstrappable.org/
| jeffrallen wrote:
| This is excellent! And a great way to understand what's the
| simplest a computer and a programming language can be. Students
| should puzzle their way through exercises built on systems like
| this.
| dvfjsdhgfv wrote:
| Being able to compile itself is actually a huge milestone,
| congratulations!
| jhgb wrote:
| > whereas mine is 1600 lines of formatted Go.
|
| I wonder how the whole thing compares with the newest Oberon
| (which should be at around 2900 lines of code or so). After all,
| if you drink enough beers, Oberon might look like "a subset of Go
| that can compile itself" to you.
| benhoyt wrote:
| Wow, I had no idea the Oberon implementation was so small,
| thanks. And it looks like it's implemented in a similar way,
| with parsing and code generation mingled together. It is ~2900
| lines of code, though it's very dense. You can see it here:
| http://www.projectoberon.com/ ORB - symbol
| table handling - 430 lines ORG - code generator (for
| RISC) - 1115 lines ORP - parser (calls scanner and code
| generator - 997 lines ORS - scanner/lexer - 312 lines
| iovrthoughtthis wrote:
| great post, this part
|
| > For reference, a Python version of this same loop runs in 1
| minute and 38 seconds ... a dynamically-typed bytecode
| interpreter is not a good choice for heavy integer arithmetic.
|
| made me wonder about other dynamic langs.
|
| ruby (3.0.0) and lua (5.4.2) ran it in 1:30 and 1:03 respectively
| but node (12.22.0) ran it in 0:01. how is node so quick?
| elcomet wrote:
| Node uses a JIT compiler (V8) and is not interpreted. This is
| why it's so fast.
|
| You could compare it with PyPy, a just-in-time compiler for
| python.
|
| For example, on my machine, the same program runs in 2:12 min
| with the regular python interpreter, and in 1.57 seconds with
| pypy.
| inbx0 wrote:
| Well, for one thing, if you copied the numbers from the example
| as they are defined, then JS will be dealing with simple
| floating point numbers and actually gets the answer wrong (but
| gets there fast!). The answer is beyond 5E+17 but JS's
| Number.MAX_SAFE_INTEGER is only 9E+15. The results start
| getting weird after that.
|
| To get the correct answer, we need to convert some of the
| numbers to BigInts var result = 0n
| function main() { var sum = 0n var i = 1
| while (i <= 1000000000) { sum = sum + BigInt(i)
| i = i + 1 } result = sum }
| main()
|
| but after that the result is not quite so impressive anymore
| Now using node v15.6.0 (npm v7.4.0) time node index.js
| node index.js 51.50s user 0.40s system 103% cpu 50.072 total
| zeugmasyllepsis wrote:
| Thank you for providing details and not just assuming the JIT
| was the cause of the difference.
| josefx wrote:
| Ruby has its jit disabled by default as far as I can tell^1 .
| Lua exists in both jit compiled and interpreted forms. Python
| definitely is interpreted unless you run it through pypy.
| JavaScript is basically the only language in that lineup that
| enables just in time compilation with most of the "default"
| runtimes.
|
| ^1 Most likely because it generates C code and has to call an
| external C compiler to get things done.
| deweller wrote:
| For reference, this runs in 13s in PHP 8 on my ARM Mac. PHP 8
| uses a JIT compiler.
| thysultan wrote:
| NodeJs and Go can and probably are optimizing the loop away,
| the authors attempt to assign the sum to result does not
| prevent this from happening.
| benhoyt wrote:
| > can and probably are optimizing the loop away
|
| Why do you think that, and how would it do that and still do
| the calculation? My timing tests showed that Go didn't
| optimize the loop away (if I counted to 10 billion it took
| about 10 times as long), and godbolt.org also shows the loop
| in Go's assembly output: https://www.godbolt.org/z/d38xfGYr6
| makapuf wrote:
| Fun fact: in C, gcc and clang can(and will) replace the sum
| of n first integers with a loop (and slight variations)
| with Euler formula n(n+1)/2. Impressive.
| benhoyt wrote:
| Wow, that is impressive, thanks. I confirmed this with
| godbolt.org at -O2.
| Joker_vD wrote:
| Erlang/OTP 22 (on my machine(tm), of course) runs it in 9.3
| seconds when invoked from the compiled module, and in at least
| in 5 minutes when invoked as a fun defined in the shell: "at
| least" because I did not have enough patience to wait until it
| actually completes.
| jerf wrote:
| If you think about what something like CPython is doing, you
| are maximally pessimizing the way the implementation works by
| adding integers together. You get all the dynamic type
| manipulation and management that is occurring all the time, but
| then when it comes time to do the 'payload' of the operation it
| is doing, it's a simple ADD operation that can complete in
| (amortized) less than one cycle. It's the worst case for
| something like CPython. All that work a dynamic language is
| doing has much better performance characteristics when you get
| to a "payload" that is like running C code to find a substring
| match in a string, or something else non-trivial that helps to
| amortize all the bookkeeping with more real work.
|
| As for why Node runs quickly, adding integers (or floats or
| whatever) is easy mode for a JIT. This is basically why you see
| JIT numbers like speeding up the original implementation by
| factors of literally over 100x, yet you can't expect that
| multiplier against "real code", for essentially the same reason
| only now on the other end; the JIT can very successfully cut
| out all that work CPython or similar implementatiors are doing
| if you have completely stable types that it can infer
| (especially something as simple as a single int or float), but
| now the "payload" is _preventing_ the JIT from posting big
| numbers in improvement when you have more complicated
| operations because the JIT isn 't going to be able to improve
| something like a substring search. If you get something like
| NumPy where you can feasibly write a program that will spend
| 99+% of its time in highly optimized matrix code, a JIT can't
| help much with that, because even if it reduced the bookkeeping
| to zero, you've still got to do all that matrix work.
|
| Statically-typed compiled languages basically bypass the whole
| "do lots of expensive bookkeeping to permit very flexble coding
| styles, then write a JIT that figures out how to bypass that
| expensive bookkeeping as often as possible" to simply constrain
| you to write in the fast subset in the first place.
| sfifs wrote:
| Years of Google engineers squeezing performance for JavaScript
| heavy web pages maybe?
| iovrthoughtthis wrote:
| That is certainly how, but why? What have they done that
| enables it to do that?
| pjmlp wrote:
| Here, have some fun,
|
| "V8 Internals For JavaScript Developers"
|
| https://vimeo.com/312113316
|
| https://v8.dev/blog
| cdogl wrote:
| Chrome's V8 engine made quite a stir around the turn of the
| last decade because it it compiled JavaScript down to
| native code. It would even watch how code was called (if it
| couldn't statically figure out a safe way to compile it)
| and generate native code for the happy paths in hot code
| paths. I simply assume that all browsers do this now
| because we take it for granted now that JS is reasonably
| fast, but I could be wrong.
|
| I credit much of Chrome's success to V8. It's a marvel of
| engineering. Shame that the rest of Chrome hasn't been held
| to such a high standard.
| josefx wrote:
| > I simply assume that all browsers do this now because
| we take it for granted now that JS is reasonably fast,
| but I could be wrong.
|
| Chrome came out a month after Mozilla published its
| TraceMonkey JIT. So JITing JavaScript has been the
| standard thing to do since the end of 2008.
| KMag wrote:
| ... and TraceMonkey was based on open-sourced
| Macromedia/Adobe code from the ActionScript 3 VM. AS3 was
| loosely based on a draft of the ill-fated ES 4, but I'm
| not sure if it was a strict superset of an earlier
| ECMAScript standard. It's possible that the Flash AS3 VM
| had been JITing all of ES2 or ES3 earlier than
| TraceMonkey came out.
| [deleted]
| dimes wrote:
| That's amazing! Congrats. One question: why is var only supported
| for top level declarations? I'm not a compiler or programming
| language expert by any means, but I thought the general approach
| was to define a grammar and write a parser for it. I'm curious
| how global and local declarations diverged so significantly.
| dahfizz wrote:
| This is a common simplification in compilers. Even C, until
| 1999, forced you to declare your variables at the top of the
| function before any code.
|
| Its not a limitation of the parser, but simply that having
| variable declarations anywhere makes the compiler more complex.
| Its harder to calculate stack space / stack pointers when the
| variables are declared all over the place.
| Joker_vD wrote:
| To clarify why the stack usage calculation is hard when local
| variable declarations are allowed everywhere, consider
| compiling this example code: int a;
| if (...) { int b; int c;
| if (...) { int d; } else {
| int e; int f; } int
| g; } else { int h; } int
| i;
|
| Let's start assigning offsets to the variables, starting at
| offset 0. The "a" variable gets 0, "b" gets 1, "c" gets 2,
| "d" gets 3, "e" gets 4, "f" gets 5, "g" gets 6, "h" gets 7,
| and "i" gets offset 8, so we need to allocate 9 * sizeof(int)
| bytes on the stack for the local variables.
|
| Wait, there is no complication at all, so what's the problem?
| Well, the problem is unoptimal stack usage: at any point,
| only 5 ints at most are alive, so you can alias "b" with "h"
| and "i", and alias "d" with "e" and "g", and save 4 *
| sizeof(int) bytes of the stack. But to do this you need to
| track the nesting levels (so you could reset the current
| allocated offset back to the one used at the start of the
| block) -- and if you're writing a one-pass recursive-descent
| compiler, you get this for (almost) free anyway... huh.
|
| Did I miss something? It seems there is actually no
| complication at all, just unwillingness to implement such a
| feature.
| Someone wrote:
| You 'have' to go back to the start of your function to fill
| in the correct size of the stack frame (An alternative is
| to make a stack frame for every variable or basic block.
| That's a waste of cycles and code, especially on early
| machines)
|
| Again, the solution to that shouldn't stop you, as it is
| similar to how you handle filling in the branch offsets in
| forward branches in if/then/else, and that's something you
| have to do, anyways (yes, you can stream out your object
| code without going back and filling in branch offsets but
| that leads to slow, ugly code)
|
| I think the reason this _was_ done in old languages is that
| parsing and compiling were black arts when those languages
| were created. COBOL and FORTRAN didn't use stack frames,
| Algol invented per-function stack frames and later loosened
| that to per-block ones. That's what K&R C did, too.
|
| I don't know which language invented the idea that you
| could lossen that further, but it might have been C++.
| [deleted]
| yencabulator wrote:
| Now do that with the rest of the compilation in one pass,
| without an AST.
| Someone wrote:
| If you target a simple CPU (no out-of-order execution, no
| pipelines), and are ignoring performance (as early C
| compilers did. If you wanted performance, you inserted
| 'register' in the right places (and you knew what the
| right places where because you wrote the compiler or were
| setting next to the guy who did), or dropped down to
| assembly), it's not that difficult.
|
| The tricky bits are the stack offsets to local variables.
| If you compile int a = foo() int b
| = a + 1;
|
| to call foo push add 1
| push
|
| the addresses of variables relative to the stack pointer
| change when you create new locals, when locals go out of
| scope, or when you push function arguments onto the
| stack. The trick is for the compiler to track the address
| of locals relative to the top of the stack frame and keep
| track of the depth of the stack frame. Then, when
| generating code to access a stack-relative location,
| combine the two to translate 'relative to top of stack
| frame' to 'relative to stack pointer'. A bit tedious, but
| doable.
|
| Now, when you want performance (especially on modern
| CPUs), and don't want to use stack space for variables
| kept in registers, things do get significantly more
| complicated.
| makapuf wrote:
| But here its only the var syntax that is not allowed, the
| newvar:=someexpression() is allowed, having the same effect
| of declaring a new local variable if I understand correctly?
| Edit: typo
| yencabulator wrote:
| I think it's just a case of keeping it simple. Basically,
| it looks like Mugo supports what Mugo consumes (and about
| two simple additions). Adding local `var` parsing likely
| would complicate the hand-crafted parser.
| benhoyt wrote:
| Yes, that's correct. I actually initially implemented
| local "var" parsing, but was never using it in the
| compiler, so I removed it to trim it down and keep things
| as simple as reasonably possible.
| userbinator wrote:
| _Fabrice's C compiler is implemented in 2048 bytes of obfuscated
| C, whereas mine is 1600 lines of formatted Go._
|
| "It can compile itself" is an interesting benchmark of language
| complexity, and we now have of C and Go, so I wonder how the
| other common languages would compare. Of course you can "cheat"
| by choosing a subset that reduces the complexity of the whole.
|
| _As you can see, Mugo uses its own very inefficient ABI that's
| nothing like the x86-64 ABI - the standard ABI puts the first six
| "cells" (64-bit values) in registers._
|
| Considering that stacking args is the norm for 32-bit x86 and few
| other architectures, it's not that bad, especially since CPUs
| have special hardware to handle push/pop.
| pjmlp wrote:
| Most compiled languages are bootstraped, how come "we now have
| of C and Go" ?
| userbinator wrote:
| Can you point to any other languages' examples of minimal
| self-compiling compilers?
| eat_veggies wrote:
| Here's a (probably non exhaustive) list I found
|
| https://en.wikipedia.org/wiki/Self-
| hosting_(compilers)#List_...
___________________________________________________________________
(page generated 2021-04-12 23:01 UTC)