[HN Gopher] Compilers: Incrementally and Extensibly (2024)
       ___________________________________________________________________
        
       Compilers: Incrementally and Extensibly (2024)
        
       Author : todsacerdoti
       Score  : 138 points
       Date   : 2025-04-05 12:55 UTC (1 days ago)
        
 (HTM) web link (okmij.org)
 (TXT) w3m dump (okmij.org)
        
       | soegaard wrote:
       | The notes reference:                   "An Incremental Approach
       | to Compiler Construction"         Abdulaziz Ghuloum
       | http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf
       | 
       | Can't recommend it enough.
        
         | sitkack wrote:
         | Andy Keep - Writing a Nanopass Compiler
         | https://www.youtube.com/watch?v=Os7FE3J-U5Q
         | 
         | https://github.com/akeep/scheme-to-c
        
         | mncharity wrote:
         | And as a github search[1], I recall it turning up fun work[2]
         | amidst clutter.
         | 
         | Hmm, I wonder if an LLM could sift a "Essentials of
         | Compilation" search[3] for _interesting_ repos?
         | 
         | [1] https://github.com/search?q=Ghuloum&type=repositories
         | https://github.com/search?q=incremental%20approach%20to%20co...
         | [2] just some old bookmarks: https://github.com/namin/inc
         | https://github.com/iambrj/imin
         | https://github.com/jaseemabid/inc [3]
         | https://github.com/search?q=Essentials%20of%20Compilation&ty...
        
       | djoldman wrote:
       | Learning and working on this stuff almost always felt "all or
       | nothing," (either it worked or it didn't), which was very
       | frustrating.
       | 
       | This looks interesting:
       | 
       | > Like Ghuloum [2006], this course is different. Our development
       | is by extending the complete, working compiler one small step at
       | a time. At each step we end up with the working compiler, for a
       | subset of the source language. Specifically, the methodology
       | ([Ghuloum, 2006, SS2.6]):
       | 
       | > 1. choose a small subset of the source language that is easy to
       | directly compile to assembly;
       | 
       | > 2. Write the extensive test cases;
        
       | Jtsummers wrote:
       | Another text using this approach:
       | 
       | https://mitpress.mit.edu/9780262047760/essentials-of-compila... -
       | Jeremy Siek's _Essentials of Compilation_ (there 's also a Python
       | version)
       | 
       | Admittedly I didn't finish working through it (life got in the
       | way), but I worked through a good chunk of it with a colleague
       | from a non-CS background. He wanted to understand compilers and
       | it seemed to be much more effective than other texts he'd tried
       | to work through. I'll have to look at this one at some point,
       | probably this summer when things are usually calmer for me.
        
         | sitkack wrote:
         | This is such a wonderful book.
         | 
         | This book is open access,
         | https://github.com/IUCompilerCourse/Essentials-of-Compilatio...
         | both the Python and the Racket versions.
         | 
         | If you are the kinda nerd that has nerd book club with friends,
         | this is book is _perfect_ for it.
        
       | nate_martin wrote:
       | >The compiler itself is to be developed in OCaml.
       | 
       | These seems like a misstep that I've seen in a few other compiler
       | implementation courses. For some reason these programming
       | language professors always insist on completing the project in
       | their personal favorite language (Haskell, OCaml, Standard ML,
       | etc).
       | 
       | As a student this makes the project significantly more difficult.
       | Instead of just learning how to implement a complier, you're also
       | learning a new (and likely difficult) programming language.
        
         | Jtsummers wrote:
         | Learning a new programming language isn't that hard of a task,
         | every decent programmer will learn a dozen or maybe even dozens
         | over their career.
         | 
         | Also, neither OCaml nor SML are hard to learn. Haskell is more
         | challenging, but that's because it's become, in a sense,
         | multiple languages. The core of Haskell is no harder than OCaml
         | or SML to learn, except for reasoning about lazy evaluation and
         | some of its consequences. All the things people use on top of
         | Haskell, though, does make it more to learn but what you'd need
         | to reach equivalent utility as SML or OCaml for a compilers
         | course is not that hard to learn.
        
           | remexre wrote:
           | Many universities that use OCaml in upper-division courses
           | also use it in lower-division; my university requires all CS
           | majors to take a course that is taught in OCaml, and covers
           | higher-order programming, "advanced" (Hindley-Milner) type
           | systems, equational reasoning, etc., typically in their
           | sophomore year.
           | 
           | The compilers class can then be taught in it without worrying
           | about that problem much.
        
         | mrkeen wrote:
         | I grew up with "easy" languages. Pascal, VB, C, C++, Java, C#.
         | And frankly I'm getting real sick of them.
         | 
         | I'm porting Dijkstra's algorithm over to C# at the moment, and
         | in the last several hours here's the two most clownish things
         | that have happened:
         | 
         | 1) I have:                 if (node is null) {         ..
         | }
         | 
         | My IDE is telling me "Expression is always false according to
         | nullable reference types' annotations". Nevertheless it enters
         | that section every time.
         | 
         | 2) I have:                 SortedSet<int> nums = [];
         | Console.Out.WriteLine(nums.Min);
         | 
         | You know what this prints?                 0
         | 
         | The minimal element of a set which has no elements is 0.
         | 
         | Yes, every language has its warts, and anecdotes like this
         | aren't going to change anyone's mind. I only wrote these up
         | because they're hitting me right now in one of those languages
         | that companies use "because anyone can learn them". I like
         | harder languages better than easy languages, because they're
         | easy.
        
           | neonsunset wrote:
           | This is possible if you disregarded compiler warnings and
           | assigned null to a non-nullable location.
        
             | nextaccountic wrote:
             | Feels like this should be a compiler error. Or saying
             | otherwise, this location isn't very non-nullable if you can
             | assign null to it
        
               | neonsunset wrote:
               | Just add <WarningsAsErrors>nullable</WarningsAsErrors> to
               | the project manifest.
        
           | tester756 wrote:
           | 2)
           | 
           | In general I'd recommend using Min() (from LINQ) which works
           | as expected
           | 
           | But this property has this remark: "If the SortedSet<T> has
           | no elements, then the Min property returns the default value
           | of T."
           | 
           | 1)
           | 
           | Feels like you used compiler hints incorrectly
        
         | mattgreenrocks wrote:
         | Algebraic data types are superb for compilers. Case
         | exhaustiveness checks are really helpful in this domain,
         | especially when doing any sort of semantic analysis.
         | 
         | Frankly, I try to avoid languages that don't have ADTs as much
         | as possible. They are incredibly useful for specifying
         | invariants via your design, and their constraints on inputs
         | lend themselves to easier implementation and maintenance.
        
         | linguae wrote:
         | In defense of such compilers professors, part of the purpose of
         | a good undergraduate computer science program is to expose
         | students to different programming language paradigms. Computing
         | professionals are expected to grasp new languages, APIs, and
         | tools quickly. Additionally, certain problems are easier to
         | express in certain paradigms. For example, I would use a
         | language like C or perhaps Rust in an operating systems class
         | since we need to operate at a level that is much closer to the
         | hardware. If I were teaching a course on machine learning, I'd
         | use Python for its excellent ecosystem of libraries, though R
         | and Julia are good alternatives. A course on relational
         | databases should teach basic SQL.
         | 
         | Back in the 2000s there were some CS undergraduate programs
         | that attempted to use Java in the entire curriculum, from
         | introductory courses all the way to senior-level courses such
         | as compilers. There was even an operating systems textbook that
         | had Java examples throughout the text
         | (https://www.amazon.com/Operating-System-Concepts-Abraham-
         | Sil...).
         | 
         | I think using only one language for the entire undergraduate CS
         | curriculum is a mistake. Sure, students don't have to spend
         | time learning additional languages. However, everything has to
         | fit into that one language, depriving students the opportunity
         | to see how languages that are better suited to specific types
         | of problems could actually enhance their understanding of the
         | concepts they are learning. In the case of Java, it's a good
         | general-purpose programming language, but there are classes
         | such as computer organization and operating systems where it's
         | important to discuss low-level memory management, which
         | conflicts with Java's automatic memory management.
         | 
         | When it comes to writing compilers, it turns out that
         | functional programming languages with algebraic data types and
         | pattern matching make working with abstract syntax trees much
         | easier. I learned this the hard way when I took compilers in
         | 2009 at Cal Poly. At the beginning of the course, we were given
         | two weeks to write an AST interpreter of a subset of Scheme. My
         | lab partner and I didn't like Dr. Scheme (now known as Racket),
         | which we "endured" the previous quarter in a class on
         | implementing programming language interpreters, and so we set
         | about writing our interpreter in C++. It turned out to be a big
         | mistake. We got it done, but it took us 40 hours to implement,
         | and we had a complex class hierarchy to implement the AST. We
         | realized that Dr. Scheme's features were well-suited for
         | manipulating and traversing ADTs. We never complained about Dr.
         | Scheme or functional programming again, and we gladly did our
         | other compiler assignments in that language.
         | 
         | 16 years later, I teach Haskell to my sophomore-level students
         | in my discrete mathematics class at a Bay Area community
         | college that uses C++ for the introductory programming course.
        
           | DKordic wrote:
           | Interesting arguments.
           | 
           | Lisp Machines? "Separation of Concerns" [1]? Conway's Law?
           | 
           | [1]: http://alarmingdevelopment.org/?p=805
        
         | sn9 wrote:
         | You have to pick some language and some amount of students are
         | likely to be new to it, so you might as well pick one that
         | excels at the task of writing compilers as ML-family languages
         | are.
        
         | DKordic wrote:
         | I have a challenge for You: make applications of Your choice of
         | BlackPill [1] or Radxa ZERO 3E [2].
         | 
         | Hint: You might be interested in Forth Systems and Lisp
         | Machines.
         | 
         | [1]: https://github.com/WeActStudio/WeActStudio.MiniSTM32F4x1
         | [2]: https://radxa.com/products/zeros/zero3e
        
       | norir wrote:
       | The hardest part of writing a compiler is designing a language.
       | In my opinion, most compiler courses seem too fixated on the
       | implementation details. The fact that this course targets x86
       | already misses the mark for me. Computers are very fast, there is
       | no reason to start with a language that translates to asm. This
       | is incidental complexity that distracts from language design.
        
         | wseqyrku wrote:
         | > The hardest part of writing a compiler is designing a
         | language.
         | 
         | That's a completely separate field. If you want to learn about
         | language design you should study that, though it helps if you
         | have some background on both to get started with either one.
        
         | dccsillag wrote:
         | Hard disagree. There are a lot of implementation "details"
         | that, if you want to do properly, are a lot of hard work and
         | very much nontrivial. For example, do try to write a compiler
         | with efficient incremental compilation, and especially one that
         | does so while also having optimization passes. And that's just
         | one example, most things in compiler implementations actually
         | turn out to be fairly complex. And lots of features that modern
         | languages support e.g. more powerful typesystems,
         | trait/typeclass systems, etc. are also very very tricky.
         | 
         | While designing a language is by no means trivial, it generally
         | really occupies just a very small fraction of the
         | language/compiler developer's time. And, in most cases, the two
         | things (language design + implementation details) have to walk
         | hand-in-hand, since small changes to the language design can
         | vastly improve the implementation end.
        
         | tiu wrote:
         | Regarding the language design part, many universities I know of
         | offer a {Advanced} Programming Languages class later on after a
         | Compilers 101 class and that is where many of the topics like
         | EBNF, Grammar Design along with different constructs are done.
         | 
         | Besides, for someone learning it for the first time I think
         | designing a new language seems a bit difficult.
        
         | f1shy wrote:
         | I disagree. First of all, you can write a compiler for any
         | existing language. But second, inhave experience writing
         | compilers and interpreters. For me the hardest was always to
         | give good information/feedback on errors. Because the compiler
         | cannot go ahead, means it is "confused" eliminates that
         | confusion to help the user is very much non trivial.
        
       | codr7 wrote:
       | I didn't make much progress on my own languages until I
       | discovered Forth, always seemed to get stuck and lose motivation
       | in the parser.
       | 
       | Writing the parser is the least interesting part of building a
       | language imo.
        
         | nickpsecurity wrote:
         | That varies by person. While it may or may not interest, I'll
         | say learning how to write or generate parsers is a useful skill
         | that might keep paying off.
         | 
         | I keep running into situations where I'd like to describe data
         | in a high level. BNF grammars often fit those situations, are
         | more readable than regex's, and could make for nice parsers.
         | One must know how to parse, though. :)
        
         | DKordic wrote:
         | Exactly! "Life After Parsing" [1].
         | 
         | IMHO C Family is a symptom of "low-level" of competence. I
         | often get surprised by superficiality of their arguments.
         | "Sophistry" is a useful concept (in a single word). Ad nauseam?
         | 
         | [1]:
         | http://www.semanticdesigns.com/Products/DMS/LifeAfterParsing...
        
       | tiu wrote:
       | I just got over writing 2 (well 2.5) toy compilers and I think a
       | lot of the material in the compiler-teaching space lack some
       | subtle developmental aspects.
       | 
       | I wish there was a course designed somewhere which talked about
       | more ingrained issues: how to structure/design the AST[0], buffer
       | based vs noncontextual tokenization/parser design, index handling
       | and error sync in the parser, supporting multiple codegen
       | architectures, handling FFI, exposing the compiler as an API for
       | external tooling, namespaces and linkage handling etc. etc. etc.
       | 
       | It is refreshing to see how Carbon designed some of its
       | components (majorly the frontend, yet to take a look at the
       | backend) as it touches on some of the subtleties I mentioned. If
       | someone is starting out on writing one, I would recommend taking
       | a look at it or any of the talks.
       | 
       | Always nice to see new material coming up. A few resources that I
       | would like to mention would be dabaez's compiler course, Khoury
       | college's compiler course (in Rust, previously i think and
       | Ocaml), Nora Sandler's book as well as http://compilerbook.org;
       | Which I consider to be the best guide out there for writing small
       | learning compilers, the videos are good as well.
       | 
       | [0]: Some related content that I enjoyed reading:
       | https://lesleylai.info/en/ast-in-cpp-part-1-variant/
        
         | seanmcdirmid wrote:
         | You can learn all that stuff by doing and looking at other
         | people's design, it's a bit niche to package that up into a ln
         | advanced language design/compiler class, everyone is making
         | different trade offs so are focusing on different things. There
         | also isn't really a market for more than a few highly trained
         | language designers and implementers, and most wind up doing
         | different things than what they were trained for.
        
         | tester756 wrote:
         | >supporting multiple codegen architectures
         | 
         | Shouldnt it be a result of modular software design?
         | 
         | >exposing the compiler as an API for external tooling
         | 
         | Isn't it just generating libraries instead of executables?
        
           | remexre wrote:
           | Making the _compiler itself_ provide an API enables things
           | like LSP, which don't want to generate machine code at all. A
           | traditional single-pass compiler usually can't accommodate
           | this without re-plumbing.
        
             | halffullbrain wrote:
             | The Eclipse Compiler for Java [1] is a notable exception,
             | architected around incremental compilation, an API for
             | "live" AST manipulation, and a layered non-batch approach
             | to when to invoke various analysis steps.
             | 
             | The LSP for Java [2] used in eg. VSCode's Java plugins,
             | builds on this API.
             | 
             | But, no, I haven't seen a generalized approach to this
             | architecture discussed in literature.
             | 
             | 1: https://github.com/eclipse-jdt/eclipse.jdt.core 2:
             | https://github.com/eclipse-jdtls/eclipse.jdt.ls
        
             | tester756 wrote:
             | I'm aware, I've used C#/.NET Compiler ecosystem and their
             | Compiler as a Service approach
             | 
             | But it seems like you're just creating libs that everyone
             | and embedd / use and that's it
             | 
             | e.g `auto ast = Compiler.GenerateAST(code);`
        
           | tiu wrote:
           | I wrote 'multiple codegen architectures' instead of 'multiple
           | architectures for codegen'.
           | 
           | As far as I have done in the toy compilers and seen the
           | things in actual production ready compilers, the codegen is
           | still very much tied to the one thing or the other rest llvm.
        
             | elcritch wrote:
             | Nim has multiple backends and is relatively mature. It's
             | fairly readable as compilers go.
             | 
             | There's also a new experimental rewrite of the Nim compiler
             | called Nimony which targets a new intermediate called NIFC.
             | That is intended to the be transformed to C, LLVM,
             | JavaScript, etc.
        
             | tester756 wrote:
             | >I wrote 'multiple codegen architectures' instead of
             | 'multiple architectures for codegen'.
             | 
             | You want to generate e.g x86 + ARM + RISCV, yea?, and
             | shouldnt it be a result of modular architecture?
             | 
             | like your various codegens just take your AST and generate
             | output
        
         | yencabulator wrote:
         | The compilerbook link is broken in multiple ways, but
         | www.compilerbook.com redirects to
         | https://www3.nd.edu/~dthain/compilerbook/
        
           | tiu wrote:
           | Yes, thank you. That is the Douglas Thain book I meant.
           | (compilerbook.org seems to work for me, only broken thing
           | would be the semicolon).
        
       | convivialdingo wrote:
       | I've been modifying the the MIR c2mir JIT compiler to extend the
       | c11 compiler to support simple classes, boxed strings(immutable,
       | nun-nullable) with AOT support.
       | 
       | Imagine if Java and C had a love child, basically.
       | 
       | MIR is a fantastic piece of engineering.
       | 
       | Honestly the hardest part is representing types. Having played
       | around with other compilers it seems to be a typical problem.
       | 
       | I'm stuck in the minutiae of representing highly structured
       | complexity and defining behavior in c. I can understand why many
       | languages have an intermediate compiler - it's too much work and
       | it will probably change over time.
        
         | fuhsnn wrote:
         | >I've been modifying the the MIR c2mir JIT compiler to extend
         | the c11 compiler to support simple classes, boxed
         | strings(immutable, nun-nullable) with AOT support
         | 
         | Is the project public? Really interested in the AOT support,
         | I've always wanted to see its generated code but didn't find an
         | easy way to dump it.
        
           | convivialdingo wrote:
           | Once things are a little more stable, I will put it up!
           | 
           | Right now you can just break before the (fun_call)() delegate
           | and disassemble the fun_call in gdb.
           | 
           | The basic trick is to add reloc support to the x86 translate
           | code, mark external calls and replace with 0x0 placeholders,
           | and copy out the machine_code and data segment output to an
           | object file.
           | 
           | I can do basic main functions with simple prints calls but
           | not much more. It's a hack for now but I'll refactor it until
           | it's solid.
        
       ___________________________________________________________________
       (page generated 2025-04-06 23:02 UTC)