[HN Gopher] Writing a C compiler in 500 lines of Python
___________________________________________________________________
Writing a C compiler in 500 lines of Python
Author : vgel
Score : 170 points
Date : 2023-09-04 19:17 UTC (3 hours ago)
(HTM) web link (vgel.me)
(TXT) w3m dump (vgel.me)
| brundolf wrote:
| > Instead, we'll be single-pass: code generation happens during
| parsing
|
| IIRC, C was specifically designed to allow single-pass
| compilation, right? I.e. in many languages you don't know what
| needs to be output without parsing the full AST, but in C, syntax
| directly implies semantics. I think I remember hearing this was
| because early computers couldn't necessarily fit the AST for an
| entire code file in memory at once
| WalterBright wrote:
| You're exactly right. This makes for a small, memory-efficient
| compiler. But this entails a lot of compromises that we're not
| willing to put up with anymore, because there's no longer a
| reason to.
| frutiger wrote:
| I once read that this is why the MSVC compiler didn't support
| two-pass template instantiation until very recently: the
| original compiler implemented templates almost like a macro
| that re-emitted a stream of tokens with the template parameters
| replaced.
| speps wrote:
| Linked from another thread: http://cm.bell-
| labs.co/who/dmr/chist.html
|
| It explains the memory limits and what happened :)
|
| > After the TMG version of B was working, Thompson rewrote B in
| itself (a bootstrapping step). During development, he
| continually struggled against memory limitations: each language
| addition inflated the compiler so it could barely fit, but each
| rewrite taking advantage of the feature reduced its size. For
| example, B introduced generalized assignment operators, using
| x=+y to add y to x. The notation came from Algol 68
| [Wijngaarden 75] via McIlroy, who had incorporated it into his
| version of TMG. (In B and early C, the operator was spelled =+
| instead of += ; this mistake, repaired in 1976, was induced by
| a seductively easy way of handling the first form in B's
| lexical analyzer.)
| bee_rider wrote:
| I wonder why =+ is so obviously a mistake. It does look
| vaguely wrong for some reason, but I'm prejudiced by current
| languages.
| vgel wrote:
| I think because it's ambiguous with unary plus (a = +b),
| since C isn't supposed to have significant whitespace in
| most circumstances.
| tom_ wrote:
| You also run into problems with a=*p and a=-b, which are
| perhaps more likely.
| zeusk wrote:
| But they could've fixed that by going a=(*p) and a=(-b);
|
| Kind of how we use (*p)->next instead of *p->next where p
| is node_t**
| vgel wrote:
| I'm not sure, haven't looked at the codebases of old compilers
| in a long time. Definitely a lot of the language is pretty
| amenable to it, especially if you have unstructured jumps for
| e.g. the for advancement statement. I had a distinct feeling
| while writing the compiler every time I added a new feature
| that "wow, the semantics work exactly how I'd like them to for
| ease of implementation."
|
| Compare that to, say, Rust, which would be pretty painful to
| single-pass compile with all the non-local behavior around
| traits.
| [deleted]
| zabzonk wrote:
| using recursive descent, you don't need to build an ast
| jjtheblunt wrote:
| the call stack during recursive descent is an ephemeral ast,
| for the recursive descent parsers I've written.
| WalterBright wrote:
| This looks a lot like the Tiny Pascal compiler that BYTE
| published a listing of back in 1978.
|
| http://www.trs-80.org/tiny-pascal/
|
| I figured out the basics of how a compiler works by going through
| it line by line.
| vgel wrote:
| Oh, that's neat (funny that they skipped out on similar things
| to me, like GOTO and structs :-)
|
| I didn't see a link to the source in the article, but this
| seems to be it: https://sourceforge.net/p/tiny-
| pascal/code/HEAD/tree/NorthSt...
| dugmartin wrote:
| I think Borland's Turbo Pascal was also a single pass compiler
| that emitted machine code as COM files.
| kwhitefoot wrote:
| Surely it is a feature of all Pascal compilers that they are
| single pass. I thought that it was part of the specification
| of the language that it be possible to compile in a single
| pass.
| [deleted]
| marcodiego wrote:
| It is interesting to think that 500 lines of code is something
| one can write in one or two days. But, writing a C compiler in
| 500 of comprehensible code (even in python) is challenge in
| itself that may take months after a few years of solid learning.
|
| I wonder if is this a good path to becoming an extremely
| productive developer. If some one spends time developing projects
| like this, but for different areas... A kernel, a compressor,
| renderer, multimedia/network stack, IA/ML... Will that turn a
| good dev into a 0.1 Bellard?
| rewmie wrote:
| > But, writing a C compiler in 500 of comprehensible code (even
| in python) is challenge in itself that may take months after a
| few years of solid learning.
|
| The people behind this project avoided that caveat by simply
| not implementing C. Apparently they kept a bit of the syntax
| but then proceeded to cherry-pick features that suited them and
| not make.an effort to even try to comply with any version of
| the standard.
| fanf2 wrote:
| It's a little bit bigger than Small C
| https://en.m.wikipedia.org/wiki/Small-C
| pitherpather wrote:
| > 0.1 Bellard
|
| Off topic, but a log scale might be useful: 0.1 Bellard --> -10
| deciBellards. That allows for: 0.001 Bellard --> -30
| deciBellards.
|
| Problem: Programmers with negative productivity cannot be
| represented on the same log scale.
| analog31 wrote:
| >>> Problem: Programmers with negative productivity cannot be
| represented on the same log scale.
|
| This is similar to the problem of price-to-earnings ratio.
| The ratio goes asymptotic as earnings goes through zero. It
| would be better to quote earnings-to-price ratio. Another
| screwy reciprocal unit is miles per gallon for cars.
| hedora wrote:
| Now that I have an EV with 130 miles of range, I'm finding
| miles per kwh to be much more useful than efficiency.
|
| I'd bet range anxiety was also a thing for early gasoline
| powered cars, so the early adopters of those probably
| preferred mpg over gpm.
| araes wrote:
| Sure they can. Abs(). Or if you prefer to not have quite so
| much micro-nano, you could also use Cumulative Distribution
| Functions [1] which are basically just sums of Probability
| Density [2].
|
| Are they a 4s programmer, 1s programmer, -0.5s programmer,
| -2s programmer?
|
| Plus, most people are "average" not negative productivity,
| and CDFs let you use really fun stuff like Beta Distributions
| (variable, shapable distributions) and Gamma Distributions
| (exponential distributions). They're super sweet as far as
| probability statistics.
|
| [1] https://en.wikipedia.org/wiki/Cumulative_distribution_fun
| cti...
|
| [2]
| https://en.wikipedia.org/wiki/Probability_density_function
|
| [3] https://en.wikipedia.org/wiki/Beta_distribution
|
| [4] https://en.wikipedia.org/wiki/Gamma_distribution
| hedora wrote:
| The number of people that are average is infinitesimal,
| even if you define average as "mode" instead of median or
| mean.
|
| About half are better than median though.
| [deleted]
| sciolist wrote:
| It does remind me of a project [1] Andrej Karpathy did, writing
| a neural network and training code in ~600 lines (although
| networks have easier logic to code than a compiler).
|
| [1] https://github.com/karpathy/nanoGPT
| pama wrote:
| This is an implementation of GPT using the pytorch library.
| It is not meant to be the shortest implementation of a
| trainable GPT, however it is very clean code. Pytorch does a
| lot of the heavy lifting, especially when it comes to
| training on multiple GPU. This implementation only works with
| data distributed parallel training, so one could not train
| models of the size of GPT-4 with it out of the box.
| tavianator wrote:
| Perhaps they were thinking of
| https://github.com/karpathy/micrograd
| Barrin92 wrote:
| at the very least it'll remove a lot of 'magic' from
| programming. Today a lot of people seem to be not so fond of
| university education but I'm personally very glad it made me go
| through implementing a shell, a compiler, a little toy kernel
| and so on.
|
| The feeling that you write code somewhere in the skies and have
| no idea how something works underneath has always really bugged
| me when I've used something.
| chaxor wrote:
| You don't need a university education to do those things,
| just some curiosity.
|
| The function of the university in the near future will
| probably just be to have like-minded curious people to
| discuss ideas with, and to get a better grasp of what
| problems need to be solved (specifically scientific ideas,
| rather than just applying engineering).
|
| The prestige element (specifically of certain universities
| over others, perhaps not university over high school) is
| dwindling, and hopefully will be abolished with this new
| generation.
| jabits wrote:
| A university degree is much more than this, and I think
| most people who view its value as "dwindling" have not had
| the experience...
| solumunus wrote:
| It massively depends on what degree and where I suppose.
| Most people I know with degrees view it as a complete
| waste of money.
| convolvatron wrote:
| I'm largely taught outside the academic world. so I
| sympathize with your position.
|
| however, the engineering culture which took the time to
| tell me about all these cool things and let me grow into
| being an expert in them seems to be largely gone.
| mati365 wrote:
| I made similar project in TypeScript[1]. Basically multipass
| compiler that generates x86 assembly, compiles it to binary and
| runs it. The worst thing were register allocator, designing IR
| code and assembler.
|
| [1] https://github.com/Mati365/ts-c-compiler
| amedvednikov wrote:
| Nice project!
| vgel wrote:
| Ooh, this is cool! Using WASM let me avoid writing a register
| allocator (though I probably would have just used the stack if
| I had targeted x86/ARM since I wasn't going for speed).
| kaycebasques wrote:
| Is there a C compiler written in Python that aims for maximum
| readability rather than trying to get as much done under X lines
| of code?
| muth02446 wrote:
| Not quite a C compiler but arguably better:
|
| http://cwerg.org
| vgel wrote:
| I think the code is fairly readable! It's formatted with Black
| (and therefore limited to reasonable line lengths) and well-
| commented.
|
| IMO, being under X lines of code is _part of_ the readability--
| 10,000 lines of code is hard to approach no matter how readable
| it otherwise is.
| rhabarba wrote:
| Finally, one can have inefficient C.
| wiseowise wrote:
| Why would language choice of compiler make any difference for
| efficiency of final output?
| vgel wrote:
| Maybe not the language choice, but the codegen of this
| compiler is _terrible_ because of the single-pass shortcuts
| (for example, it unconditionally loads the result of all
| assignment operations back to the stack _just in case_ you
| want to write `a = b = 1`, even though 99% of the time that
| load is immediately thrown away.)
| [deleted]
| NeuroCoder wrote:
| They didn't say the language was the issue. It doesn't
| support the full C spec. But if you want a reason why
| language might be an issue for a compiler, it could make
| compilation time slower. But I think the point of this
| project is not real world use but fun demonstration of skill
| folmar wrote:
| Always remember _bashcc_.
| [deleted]
| MaxBarraclough wrote:
| There's always the _CINT_ interpreter for C and C++.
|
| https://root.cern.ch/root/html534/guides/users-guide/CINT.ht...
| brnt wrote:
| A PTSD trigger for me. Only half joking. Funny thing is, I
| never checked out Cling to see if it was at long last the
| real deal.
| nn3 wrote:
| Just for comparison the LOCs for some other small C or C like
| compilers. It's not that far away from Ritchie's
|
| C4x86 | 0.6K (very close)
|
| small C (x86) | 3.1K
|
| Ritchie's earliest struct compiler | 2.3K
|
| v7 Unix C compiler | 10.2K
|
| chibicc | 8.4K
|
| Biederman's romcc | 25.0K
| userbinator wrote:
| This one is certainly stretching the definition of "C like",
| but it's just under 512 _bytes_ :
| https://news.ycombinator.com/item?id=36064971
| vgel wrote:
| Oh, C4 is neat--technically it has me beat since it also
| implements the VM to _run_ the code--though their formatting
| definitely takes advantage of long lines :-)
| [deleted]
| tptacek wrote:
| A time-honored approach!
|
| https://www.blackhat.com/presentations/win-usa-04/bh-win-04-...
|
| (minus directly emitting opcodes, and fitting into 500 lines, of
| course.)
| fan_of_yoinked wrote:
| I love the graphic - would go see the worlds largest chomsky
| rcarmo wrote:
| I have to wonder if there's a Scheme to WASM compiler out there
| someplace right now I haven't found yet.
| vgel wrote:
| Looks like Schism (https://github.com/schism-lang/schism) got
| part of the way there, but it unfortunately seems to be dead.
| teddyh wrote:
| For some value of "C":
|
| > _Notably, it doesn 't support:_
|
| > _structs :-( would be possible with more code, the fundamentals
| were there, I just couldn 't squeeze it in_
|
| > _enums / unions_
|
| > _preprocessor directives (this would probably be 500 lines by
| itself...)_
|
| > _floating point. would also be possible, the wasm_type stuff is
| in, again just couldn 't squeeze it in_
|
| > _8 byte types (long /long long or double)_
|
| > _some other small things like pre /post cremements, in-place
| initialization, etc., which just didn't quite fit any sort of
| standard library or i/o that isn't returning an integer from
| main()_
|
| > _casting expressions_
| [deleted]
| vgel wrote:
| Well, I set the 500 line budget up front, and that was really
| as much as I could fit with reasonable formatting. I'll be
| excited to see your 500 line C compiler supporting all those
| features once it's done ;-)
| spease wrote:
| C--23
|
| (Respect to the author for doing this, I just couldn't resist
| the obvious joke)
| vgel wrote:
| I actually almost made it a C--
| (https://www.cs.tufts.edu/~nr/c--/download/ppdp.pdf)
| compiler, but IIRC the `goto`s made me go with the regular C
| subset instead.
___________________________________________________________________
(page generated 2023-09-04 23:00 UTC)