[HN Gopher] Compilers: Incrementally and Extensibly (2024)
       ___________________________________________________________________
        
       Compilers: Incrementally and Extensibly (2024)
        
       Author : todsacerdoti
       Score  : 97 points
       Date   : 2025-04-05 12:55 UTC (10 hours ago)
        
 (HTM) web link (okmij.org)
 (TXT) w3m dump (okmij.org)
        
       | soegaard wrote:
       | The notes reference:                   "An Incremental Approach
       | to Compiler Construction"         Abdulaziz Ghuloum
       | http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf
       | 
       | Can't recommend it enough.
        
         | sitkack wrote:
         | Andy Keep - Writing a Nanopass Compiler
         | https://www.youtube.com/watch?v=Os7FE3J-U5Q
         | 
         | https://github.com/akeep/scheme-to-c
        
         | mncharity wrote:
         | And as a github search[1], I recall it turning up fun work[2]
         | amidst clutter.
         | 
         | Hmm, I wonder if an LLM could sift a "Essentials of
         | Compilation" search[3] for _interesting_ repos?
         | 
         | [1] https://github.com/search?q=Ghuloum&type=repositories
         | https://github.com/search?q=incremental%20approach%20to%20co...
         | [2] just some old bookmarks: https://github.com/namin/inc
         | https://github.com/iambrj/imin
         | https://github.com/jaseemabid/inc [3]
         | https://github.com/search?q=Essentials%20of%20Compilation&ty...
        
       | djoldman wrote:
       | Learning and working on this stuff almost always felt "all or
       | nothing," (either it worked or it didn't), which was very
       | frustrating.
       | 
       | This looks interesting:
       | 
       | > Like Ghuloum [2006], this course is different. Our development
       | is by extending the complete, working compiler one small step at
       | a time. At each step we end up with the working compiler, for a
       | subset of the source language. Specifically, the methodology
       | ([Ghuloum, 2006, SS2.6]):
       | 
       | > 1. choose a small subset of the source language that is easy to
       | directly compile to assembly;
       | 
       | > 2. Write the extensive test cases;
        
       | Jtsummers wrote:
       | Another text using this approach:
       | 
       | https://mitpress.mit.edu/9780262047760/essentials-of-compila... -
       | Jeremy Siek's _Essentials of Compilation_ (there 's also a Python
       | version)
       | 
       | Admittedly I didn't finish working through it (life got in the
       | way), but I worked through a good chunk of it with a colleague
       | from a non-CS background. He wanted to understand compilers and
       | it seemed to be much more effective than other texts he'd tried
       | to work through. I'll have to look at this one at some point,
       | probably this summer when things are usually calmer for me.
        
         | sitkack wrote:
         | This is such a wonderful book.
         | 
         | This book is open access,
         | https://github.com/IUCompilerCourse/Essentials-of-Compilatio...
         | both the Python and the Racket versions.
         | 
         | If you are the kinda nerd that has nerd book club with friends,
         | this is book is _perfect_ for it.
        
       | nate_martin wrote:
       | >The compiler itself is to be developed in OCaml.
       | 
       | These seems like a misstep that I've seen in a few other compiler
       | implementation courses. For some reason these programming
       | language professors always insist on completing the project in
       | their personal favorite language (Haskell, OCaml, Standard ML,
       | etc).
       | 
       | As a student this makes the project significantly more difficult.
       | Instead of just learning how to implement a complier, you're also
       | learning a new (and likely difficult) programming language.
        
         | Jtsummers wrote:
         | Learning a new programming language isn't that hard of a task,
         | every decent programmer will learn a dozen or maybe even dozens
         | over their career.
         | 
         | Also, neither OCaml nor SML are hard to learn. Haskell is more
         | challenging, but that's because it's become, in a sense,
         | multiple languages. The core of Haskell is no harder than OCaml
         | or SML to learn, except for reasoning about lazy evaluation and
         | some of its consequences. All the things people use on top of
         | Haskell, though, does make it more to learn but what you'd need
         | to reach equivalent utility as SML or OCaml for a compilers
         | course is not that hard to learn.
        
           | remexre wrote:
           | Many universities that use OCaml in upper-division courses
           | also use it in lower-division; my university requires all CS
           | majors to take a course that is taught in OCaml, and covers
           | higher-order programming, "advanced" (Hindley-Milner) type
           | systems, equational reasoning, etc., typically in their
           | sophomore year.
           | 
           | The compilers class can then be taught in it without worrying
           | about that problem much.
        
         | mrkeen wrote:
         | I grew up with "easy" languages. Pascal, VB, C, C++, Java, C#.
         | And frankly I'm getting real sick of them.
         | 
         | I'm porting Dijkstra's algorithm over to C# at the moment, and
         | in the last several hours here's the two most clownish things
         | that have happened:
         | 
         | 1) I have:                 if (node is null) {         ..
         | }
         | 
         | My IDE is telling me "Expression is always false according to
         | nullable reference types' annotations". Nevertheless it enters
         | that section every time.
         | 
         | 2) I have:                 SortedSet<int> nums = [];
         | Console.Out.WriteLine(nums.Min);
         | 
         | You know what this prints?                 0
         | 
         | The minimal element of a set which has no elements is 0.
         | 
         | Yes, every language has its warts, and anecdotes like this
         | aren't going to change anyone's mind. I only wrote these up
         | because they're hitting me right now in one of those languages
         | that companies use "because anyone can learn them". I like
         | harder languages better than easy languages, because they're
         | easy.
        
           | neonsunset wrote:
           | This is possible if you disregarded compiler warnings and
           | assigned null to a non-nullable location.
        
           | tester756 wrote:
           | 2)
           | 
           | In general I'd recommend using Min() (from LINQ) which works
           | as expected
           | 
           | But this property has this remark: "If the SortedSet<T> has
           | no elements, then the Min property returns the default value
           | of T."
           | 
           | 1)
           | 
           | Feels like you used compiler hints incorrectly
        
         | mattgreenrocks wrote:
         | Algebraic data types are superb for compilers. Case
         | exhaustiveness checks are really helpful in this domain,
         | especially when doing any sort of semantic analysis.
         | 
         | Frankly, I try to avoid languages that don't have ADTs as much
         | as possible. They are incredibly useful for specifying
         | invariants via your design, and their constraints on inputs
         | lend themselves to easier implementation and maintenance.
        
         | linguae wrote:
         | In defense of such compilers professors, part of the purpose of
         | a good undergraduate computer science program is to expose
         | students to different programming language paradigms. Computing
         | professionals are expected to grasp new languages, APIs, and
         | tools quickly. Additionally, certain problems are easier to
         | express in certain paradigms. For example, I would use a
         | language like C or perhaps Rust in an operating systems class
         | since we need to operate at a level that is much closer to the
         | hardware. If I were teaching a course on machine learning, I'd
         | use Python for its excellent ecosystem of libraries, though R
         | and Julia are good alternatives. A course on relational
         | databases should teach basic SQL.
         | 
         | Back in the 2000s there were some CS undergraduate programs
         | that attempted to use Java in the entire curriculum, from
         | introductory courses all the way to senior-level courses such
         | as compilers. There was even an operating systems textbook that
         | had Java examples throughout the text
         | (https://www.amazon.com/Operating-System-Concepts-Abraham-
         | Sil...).
         | 
         | I think using only one language for the entire undergraduate CS
         | curriculum is a mistake. Sure, students don't have to spend
         | time learning additional languages. However, everything has to
         | fit into that one language, depriving students the opportunity
         | to see how languages that are better suited to specific types
         | of problems could actually enhance their understanding of the
         | concepts they are learning. In the case of Java, it's a good
         | general-purpose programming language, but there are classes
         | such as computer organization and operating systems where it's
         | important to discuss low-level memory management, which
         | conflicts with Java's automatic memory management.
         | 
         | When it comes to writing compilers, it turns out that
         | functional programming languages with algebraic data types and
         | pattern matching make working with abstract syntax trees much
         | easier. I learned this the hard way when I took compilers in
         | 2009 at Cal Poly. At the beginning of the course, we were given
         | two weeks to write an AST interpreter of a subset of Scheme. My
         | lab partner and I didn't like Dr. Scheme (now known as Racket),
         | which we "endured" the previous quarter in a class on
         | implementing programming language interpreters, and so we set
         | about writing our interpreter in C++. It turned out to be a big
         | mistake. We got it done, but it took us 40 hours to implement,
         | and we had a complex class hierarchy to implement the AST. We
         | realized that Dr. Scheme's features were well-suited for
         | manipulating and traversing ADTs. We never complained about Dr.
         | Scheme or functional programming again, and we gladly did our
         | other compiler assignments in that language.
         | 
         | 16 years later, I teach Haskell to my sophomore-level students
         | in my discrete mathematics class at a Bay Area community
         | college that uses C++ for the introductory programming course.
        
       | norir wrote:
       | The hardest part of writing a compiler is designing a language.
       | In my opinion, most compiler courses seem too fixated on the
       | implementation details. The fact that this course targets x86
       | already misses the mark for me. Computers are very fast, there is
       | no reason to start with a language that translates to asm. This
       | is incidental complexity that distracts from language design.
        
         | wseqyrku wrote:
         | > The hardest part of writing a compiler is designing a
         | language.
         | 
         | That's a completely separate field. If you want to learn about
         | language design you should study that, though it helps if you
         | have some background on both to get started with either one.
        
         | dccsillag wrote:
         | Hard disagree. There are a lot of implementation "details"
         | that, if you want to do properly, are a lot of hard work and
         | very much nontrivial. For example, do try to write a compiler
         | with efficient incremental compilation, and especially one that
         | does so while also having optimization passes. And that's just
         | one example, most things in compiler implementations actually
         | turn out to be fairly complex. And lots of features that modern
         | languages support e.g. more powerful typesystems,
         | trait/typeclass systems, etc. are also very very tricky.
         | 
         | While designing a language is by no means trivial, it generally
         | really occupies just a very small fraction of the
         | language/compiler developer's time. And, in most cases, the two
         | things (language design + implementation details) have to walk
         | hand-in-hand, since small changes to the language design can
         | vastly improve the implementation end.
        
         | tiu wrote:
         | Regarding the language design part, many universities I know of
         | offer a {Advanced} Programming Languages class later on after a
         | Compilers 101 class and that is where many of the topics like
         | EBNF, Grammar Design along with different constructs are done.
         | 
         | Besides, for someone learning it for the first time I think
         | designing a new language seems a bit difficult.
        
         | f1shy wrote:
         | I disagree. First of all, you can write a compiler for any
         | existing language. But second, inhave experience writing
         | compilers and interpreters. For me the hardest was always to
         | give good information/feedback on errors. Because the compiler
         | cannot go ahead, means it is "confused" eliminates that
         | confusion to help the user is very much non trivial.
        
       | codr7 wrote:
       | I didn't make much progress on my own languages until I
       | discovered Forth, always seemed to get stuck and lose motivation
       | in the parser.
       | 
       | Writing the parser is the least interesting part of building a
       | language imo.
        
       | tiu wrote:
       | I just got over writing 2 (well 2.5) toy compilers and I think a
       | lot of the material in the compiler-teaching space lack some
       | subtle developmental aspects.
       | 
       | I wish there was a course designed somewhere which talked about
       | more ingrained issues: how to structure/design the AST[0], buffer
       | based vs noncontextual tokenization/parser design, index handling
       | and error sync in the parser, supporting multiple codegen
       | architectures, handling FFI, exposing the compiler as an API for
       | external tooling, namespaces and linkage handling etc. etc. etc.
       | 
       | It is refreshing to see how Carbon designed some of its
       | components (majorly the frontend, yet to take a look at the
       | backend) as it touches on some of the subtleties I mentioned. If
       | someone is starting out on writing one, I would recommend taking
       | a look at it or any of the talks.
       | 
       | Always nice to see new material coming up. A few resources that I
       | would like to mention would be dabaez's compiler course, Khoury
       | college's compiler course (in Rust, previously i think and
       | Ocaml), Nora Sandler's book as well as http://compilerbook.org;
       | Which I consider to be the best guide out there for writing small
       | learning compilers, the videos are good as well.
       | 
       | [0]: Some related content that I enjoyed reading:
       | https://lesleylai.info/en/ast-in-cpp-part-1-variant/
        
         | seanmcdirmid wrote:
         | You can learn all that stuff by doing and looking at other
         | people's design, it's a bit niche to package that up into a ln
         | advanced language design/compiler class, everyone is making
         | different trade offs so are focusing on different things. There
         | also isn't really a market for more than a few highly trained
         | language designers and implementers, and most wind up doing
         | different things than what they were trained for.
        
         | tester756 wrote:
         | >supporting multiple codegen architectures
         | 
         | Shouldnt it be a result of modular software design?
         | 
         | >exposing the compiler as an API for external tooling
         | 
         | Isn't it just generating libraries instead of executables?
        
           | remexre wrote:
           | Making the _compiler itself_ provide an API enables things
           | like LSP, which don't want to generate machine code at all. A
           | traditional single-pass compiler usually can't accommodate
           | this without re-plumbing.
        
             | halffullbrain wrote:
             | The Eclipse Compiler for Java [1] is a notable exception,
             | architected around incremental compilation, an API for
             | "live" AST manipulation, and a layered non-batch approach
             | to when to invoke various analysis steps.
             | 
             | The LSP for Java [2] used in eg. VSCode's Java plugins,
             | builds on this API.
             | 
             | But, no, I haven't seen a generalized approach to this
             | architecture discussed in literature.
             | 
             | 1: https://github.com/eclipse-jdt/eclipse.jdt.core 2:
             | https://github.com/eclipse-jdtls/eclipse.jdt.ls
        
           | tiu wrote:
           | I wrote 'multiple codegen architectures' instead of 'multiple
           | architectures for codegen'.
           | 
           | As far as I have done in the toy compilers and seen the
           | things in actual production ready compilers, the codegen is
           | still very much tied to the one thing or the other rest llvm.
        
       | convivialdingo wrote:
       | I've been modifying the the MIR c2mir JIT compiler to extend the
       | c11 compiler to support simple classes, boxed strings(immutable,
       | nun-nullable) with AOT support.
       | 
       | Imagine if Java and C had a love child, basically.
       | 
       | MIR is a fantastic piece of engineering.
       | 
       | Honestly the hardest part is representing types. Having played
       | around with other compilers it seems to be a typical problem.
       | 
       | I'm stuck in the minutiae of representing highly structured
       | complexity and defining behavior in c. I can understand why many
       | languages have an intermediate compiler - it's too much work and
       | it will probably change over time.
        
       ___________________________________________________________________
       (page generated 2025-04-05 23:00 UTC)