[HN Gopher] Resources for Amateur Compiler Writers
___________________________________________________________________
Resources for Amateur Compiler Writers
Author : kilodeca
Score : 285 points
Date : 2021-04-24 14:58 UTC (8 hours ago)
(HTM) web link (c9x.me)
(TXT) w3m dump (c9x.me)
| chandlore wrote:
| Latest versions of the ABI specifications linked in the _Machine
| Specific_ section
|
| ARM: https://github.com/ARM-software/abi-aa/releases
|
| x86-64: https://gitlab.com/x86-psABIs/x86-64-ABI (go to most
| recent CI job and download artifacts for a compiled PDF)
| hardwaregeek wrote:
| This is definitely a backend, machine code focused list. If
| you're writing a compiler to WebAssembly or even the JVM you're
| probably not gonna need a lot of these resources.
|
| Indeed I'd argue that these lists are more for an intermediate
| compiler writer. Beginner compiler writers don't need to know
| register allocation. They need to understand how to think about
| language translation, how to write an intermediate
| representation, how to learn a target language.
|
| I've been toying with the idea of writing a post that goes
| through the thought process of translation. Specifically, how you
| build states, and invariants. When I write code generation I
| spend time thinking about the different states the stack could be
| in, or what invariant I need to keep across this expression (do I
| need to save a register or global?).
|
| One simple but important example of this is realizing that given
| a proper, sound type checker, your code generator should almost
| never give errors. By the time you reach code generation you
| should have proven enough invariants that code generation cannot
| fail. This isn't always true but I've found it to be the case.
| seanmcdirmid wrote:
| That reminds me of the dragon book that goes heavily into VAX
| assembly and optimizing for CISC machines. Probably not the
| best resource for beginners or amateurs these days, but it won
| a Turing award at least.
|
| The problem is what it means to write a compiler is pretty
| broad, some amateur projects focus on backend work, some on
| front end, many in between.
| coryrc wrote:
| Also tremendous amounts just on parsing C. I found Lisp In
| Small Pieces far more interesting.
| enos_feedler wrote:
| You can certain take any path into compiler writing. I dont
| know anything about AST or front end compilers but ended up
| writing register allocators and constant folding for nvidia GPU
| JIT
| jstanley wrote:
| > Indeed I'd argue that these lists are more for an
| intermediate compiler writer. Beginner compiler writers don't
| need to know register allocation.
|
| I'm not disagreeing with you, but note that the title does not
| say it's for beginners, it says it's for amateurs, which is
| just anyone who does it for fun rather than money.
| orthoxerox wrote:
| > Beginner compiler writers don't need to know register
| allocation. They need to understand how to think about language
| translation, how to write an intermediate representation, how
| to learn a target language.
|
| I don't think people that want to write a compiler want to end
| up with helloworld.c instead of helloword.exe. At least I
| don't.
| fooker wrote:
| Ideally, you'd emit LLVM IR or some similar bytecode or
| assembly. Then you'd use llc/opt or an appropriate assmebler
| to get an executable.
|
| Trying to emit machine code isn't too interesting when you're
| writing your first compiler.
| CalChris wrote:
| Similarly, trying to parse C++ isn't too interesting when
| you're writing your first backend.
| hardwaregeek wrote:
| At least for me, compiling to a simpler target made learning
| the ideas easier. Compiling to to raw assembly does give you
| more nerd credit but it also makes it harder to get started.
| ajkjk wrote:
| The opposite, really. I started learning about compilers
| because I wanted to implement my own syntactic ideas. Don't
| care at all how it runs.
| bootwoot wrote:
| The creator of C++ would disagree, as C++ was devised as a
| compile-to-c language.
| jhgb wrote:
| Perhaps Cooper's and Torczon's "Engineering a Compiler (2nd
| Edition)" should be among the entries in the "Books" category.
| estebank wrote:
| A lot of ink is spilled on the details of how to make compilers
| go from taking text and turn it into fast executables, but I
| believe that the User Experience of compilers is as important a
| subject as those, but there's very little information about it in
| the literature, outside of a handful of papers and the design
| documents of existing compilers that do focus on this.
| dmitriid wrote:
| No idea why you are being downvoted, but it's true.
|
| Compiler UX is often horrible, but the new breed of languages
| tries to rectify parts of it:
|
| - non-crpytic error messages. Show _exactly_ what failed, where
| it failed, how to solve it.
|
| - compiler-as-a-service. Tool makers will love you.
|
| - compilation speed. If I need to wait 2 minutes for a 10 LOC
| project to compile, your compiler is doing something wrong.
|
| - the compiler should be the only tool you need. If you rely on
| 15 different tools just to compile a project, you're doing
| something wrong
| estebank wrote:
| As you allude to, treating a compiler as an isolated batch
| process is an outdated strategy. Any compiler started today
| should be designed such that it can be used in an IDE from
| the start, even if you don't implement it in that way until
| after you've reached a certain level of maturity.
|
| For error messages in particular is a very expansive topic
| because a compiler might have the following components:
| Lexer Parser Macro Expansion Name
| Resolution Lowering Type Checking
| Privacy Checking Lifetime Analysis Code
| Generation Linking
|
| and _all of them_ can produce errors. If you introduce error
| recovery in your parser, then you better make sure you 're
| carrying that information forward to name resolution and type
| checking. If you introduce a error type placeholder, better
| account for it during the rest of your analysis to avoid
| complaining about non-existing methods on something that
| wasn't resolved in the first place. Error deduplication.
| Exploring the grammar space to identify things that a human
| may attempt but isn't allowed for good reason but that will
| not blow up until some quasi-random stage. Now that I think
| about it you could write a _whole book_ about these topic.
| tomcam wrote:
| What is lowering?
| hctaw wrote:
| Some languages require higher level internal
| representations ("IR"s) to be "lowered" to lower level
| representations before code generation - in other words
| like high level semantics or syntactic sugar and
| transform into simpler constructs to compile.
|
| For example, a language could support special iterator
| semantics that will be "lowered" into a simple for loop.
| Or a for loop itself may be lowered into a do-while loop,
| depending on the design of the lowered IR.
| hctaw wrote:
| I downvoted it because I think it's a meme that ignores where
| contemporary compiler engineering is happening. We don't even
| realize they exist anymore - V8, Graal, Roslyn, etc are all
| optimizing JIT compilers running in the background that can
| compile multiple languages on the fly. And they do a great
| job.
|
| Not everything is a C++ compiler, which is inherently
| difficult to compile correctly. Let alone quickly.
|
| I think the future of compilers is that everything will have
| an incremental parser and JIT compiler with optional AOT
| compilation for release builds. Hopefully LSP and DAP will
| take over the world of tooling integration.
|
| There are other fundamental mistakes to look at than error
| messages and performance, like pretending that build systems
| don't exist and that linking multiple compilation units isn't
| something the compiler can do. Compiling code is getting much
| easier, but building it is much harder. Compilers have the
| most information and internal structure to handle all our
| build problems, but don't do it.
| turminal wrote:
| > - non-crpytic error messages. Show exactly what failed,
| where it failed, how to solve it.
|
| I'm working on a compiler in my free time and I agree with
| you, but in defense of compilers that suck at this: error
| reporting is really, really hard to do right.
| Rochus wrote:
| There is a more recent edition of "Compilers: Principles,
| Techniques, and Tools" worth considering:
| https://www.amazon.com/Compilers-Principles-Techniques-Tools....
| ChicagoDave wrote:
| I've toyed with compiler stuff for years. My current idea is a
| fluent/piping language. Is there any reference that would help
| with that?
| [deleted]
| alanafarias22 wrote:
| Amei esse posto
| alanafarias22 wrote:
| Esses comentarios ajudam muito
| andromeduck wrote:
| Wish there were more resources about or tools for bidirectional
| parsing.
| hasitseth wrote:
| Alan Holub's 'Compiler Design in C' is not mentioned in the
| books. Holub's book had well-written code. I had learned a lot
| from his code.
| elvis70 wrote:
| And a free copy of this book in PDF has been made available by
| the author: https://holub.com/compiler/
| failwhaleshark wrote:
| ProTips after writing numerous kinds of compilers in university:
|
| Recursive descent with backtracking is much easier to construct
| and maintain for diagnostics than bison/flex.
|
| Derivative parsing is very interesting.
| ahelwer wrote:
| Does anybody know of good introductory resources on error
| recovery techniques when writing a parser & interpreter? Bonus
| points if they use combinators instead of a hand-written parser.
| ruste wrote:
| I've actually been meaning to write something up about this.
| I've found you can get fairly good error messages from a parser
| combinator style parser just by saving the deepest location you
| successfully parsed to and the parser you failed in.
| jstimpfle wrote:
| Not a resource, but depending on what kind of parser you are
| doing, maybe an idea.
|
| The thing I'm going to try out next for one of my batch parsers
| (written recursive descent style as many or most parsers in the
| wild) is to have AST nodes of kind "invalid". The idea is that
| the parser should always complete, with minimal in-line error
| checking, even when I have say "require_token_kind()" or
| similar calls in my parser code.
|
| Maybe not even an "invalid" kind, but an additional "invalid"
| flag, so that even those invalid nodes can still have all the
| regular node-kind-depending structure on them.
|
| The other obvious idea is to use a parser generator. But I'm
| pretty sure I wouldn't be happy on that route - from the little
| experiments that I've done, I infer it's too bureaucratic, too
| much boilerplate for me to suffer.
|
| The more general advice for error handling is that you should
| try to make APIs that require very little in-line error
| handling. Instead, keep errors behind the APIs, maybe collect
| them in a list, or only keep the first or last error, and allow
| to handle them at a later point (out-of-line).
|
| And boom, most of the talk about smart syntax and type systems
| and other clever systems to make error handling more convenient
| suddenly becomes obsolete.
|
| Yet more abstractly, don't try to build systems that make
| cumbersome things more convenient, but think how you can do
| with less of these cumbersome things. (That works wonders for
| runtime performance, too).
| tester756 wrote:
| >The more general advice for error handling is that you
| should try to make APIs that require very little in-line
| error handling. Instead, keep errors behind the APIs, maybe
| collect them in a list, or only keep the first or last error,
| and allow to handle them at a later point (out-of-line).
|
| I guess you wanted to say that instead of printing just
| what's wrong at some point, be smarter and try to build some
| handier error handling like attaching error to ast's node and
| then maybe do something with it
| jstimpfle wrote:
| The important point is the temporal decoupling between the
| time where the error is detected and the time where it is
| handled.
| [deleted]
| UncleEntity wrote:
| _The Zephyr Abstract Syntax Description Language_ , Wang et al.
|
| Has been invaluable for creating the AST (and soon the IR
| nodes)-- though it has been a source of much yak shaving as I've
| rewritten my asdl generator at least 3 times.
|
| _Language Implementation Patterns: Create Your Own Domain-
| Specific and General Programming Languages Book_ , Terence Parr
|
| Pretty java/antlr focused but wasn't too hard to take the
| concepts and apply them to a C++ based transpiler I've been
| slowly poking at.
|
| _Combining Analyses, Combining Optimizations_ , Cliff Click
|
| His thesis and an extension of the linked paper _Global Code
| Motion, Global Value Numbering_ -- my next step down the rabbit
| hole...
|
| _Tree Automata Techniques and Applications_ , Comon et al.
|
| Still on the fence on this one vs the iburg/burs algorithm,
| probably be more useful and match better with the way Parr's book
| goes about things.
| phonebucket wrote:
| One thing I find curious: lots of compiler writing seems to be
| done in Haskell, but I can hardly find any Haskell-based
| resources for writing compilers.
|
| Can anyone here please recommend any such resources?
| toolslive wrote:
| https://www.microsoft.com/en-us/research/publication/impleme...
| atsuzaki wrote:
| This is a pretty good tutorial that I know of:
| https://www.stephendiehl.com/llvm/. The author went on to write
| a book too, though it's still in progress
| http://dev.stephendiehl.com/fun/
| tester756 wrote:
| >lots of compiler writing seems to be done in Haskell
|
| those 4fun of real world?
| mechanicalDigit wrote:
| Most compilers are written for fun. That's the point of the
| thread.
|
| Also, there is a lot of modern language research is done from
| the "lens" (heh) of Haskell, or at least the Glasgow school.
| piinbinary wrote:
| Are there any good resources for building a runtime and codegen
| for languages with garbage collection?
| moonchild wrote:
| For a basic GC, code generation should be unaffected. The only
| place you'd need to care about code generation is if you wanted
| to make a realtime GC, inserting read or write barriers.
| piinbinary wrote:
| My understanding from doing a bit of reading is that there
| are at least two things codegen-adjacent that an accurate
| collector will want (even if you stop the world while
| collecting):
|
| 1. The locations of pointers inside of allocations.
|
| 2. The locations of pointers in the stack (into which you may
| need to spill registers).
|
| I can think of ways to do these (e.g. perhaps each stack
| frame contains a bitmask at a certain offset), but I'd love
| to learn what the best practices are.
| moonchild wrote:
| Usually your objects are boxed, and the box contains a type
| identifier which the gc can use to identify pointer
| offsets. It is apparently possible to do precise garbage
| collection without boxes--I'll see if I can find the paper
| and link it later--but this is not widely done.
| aardvark179 wrote:
| The Garbage Collection Handbook is excellent, but might go into
| more depth than you need. For a simple garbage collector there
| needn't be any effect on codegen, all you need to worry about
| is the runtime side of things. Search for how to build a
| conservative garbage collector and you should find plenty of
| tutorials and introductory CS courses on the subject.
| nethunters wrote:
| Any similar resources on writing interpreters?
|
| Also is there are a technical difference between an interpreter
| and a VM?
| mtlynch wrote:
| I haven't read it, but Bob Nystrom's _Crafting Interpreters_
| has been on my list for a while:
|
| https://craftinginterpreters.com/
| smueller1234 wrote:
| Don't wait, read it. It's fun and informative, worth every
| minute! I haven't had as much fun reading a technical book in
| years!
| orthoxerox wrote:
| It's a great book. Bob's not a CS teacher, and it's great:
| the book is grounded in practice and has been structured in a
| way that lets you get a tangible result at the end of every
| chapter.
| LordN00b wrote:
| I second this. It's a great resource, very well written.
| Opinionated enough to have character and humour, but not so
| dry as to make your eyes glaze over. I guess I'm saying that
| for a technical tome it's very, very readable (as was his
| Game Design Patterns book incidentally)
| theodpHN wrote:
| Check out Peter Norvig's neat BASIC Interpreter
| https://github.com/norvig/pytudes/blob/master/ipynb/BASIC.ip...
| MH15 wrote:
| Interpreters usually execute by walking the abstract syntax
| tree at runtime, bytecode VMs usually execute actual bytecode
| similar to machine code. The bytecode for such VMs is often way
| simpler than the instruction set for an actual computer, so
| it's far more portable between platforms.
| tom_mellior wrote:
| > Interpreters usually execute by walking the abstract syntax
| tree at runtime, bytecode VMs usually execute actual bytecode
| similar to machine code.
|
| This isn't the right way to separate these concepts. A VM
| that executes bytecodes by interpretation is also an
| interpreter (Python, Ruby, PHP are well-known examples). Many
| VMs don't interpret, or they interpret mostly during startup,
| but actually try to compile hot parts of the code and execute
| that machine code rather than interpreting (Java and
| JavaScript VMs are well-known examples).
|
| The VM part more properly refers to things like data
| representation and memory management. Interpreters of an
| abstract syntax tree, interpreters of bytecodes, and just in
| time compilers all need to run inside a system that takes
| care of loading code, allocating memory when requested, and
| often doing garbage collection. These services are what make
| a VM. The exact execution strategy is not what determines VM-
| ness.
| nethunters wrote:
| Thanks for clearing that up.
|
| Which languages or implementations of languages directly
| interpret the AST without the intermediary bytecode
| compilation?
|
| I know Python, Java and JavaScript (V8 and SpiderMonkey) all
| compile to bytecode first probably to speed up subsequent
| runs and run some optimisations on the bytecode.
|
| What other benefits are there to compiling to bytecode first
| vs directly interpreting the AST?
| smueller1234 wrote:
| Perl is a tree based interpreter. It's not exactly an AST
| anymore but the time it's being executed, but close enough.
| herohamp wrote:
| Portability. Say I wanted to make language X run on all
| platforms, but I didn't actually care about compiling it on
| all platforms. I can just write a relatively simple VM for
| each platform. This is one of the reasons Java was and
| still kinda is so ubiquitous
| nethunters wrote:
| Wouldn't writing an interpreter for each platform be less
| work and achieve the same goal as writing a VM for each
| platform?
|
| Edit: ^Aside from being able to execute the bytecode on
| any platform
| kaba0 wrote:
| As others mentioned, source code should be distributed
| that way, and I think creating a simple VM is easier than
| a simple language parser. But of course, an optimizing
| one can be really quite complex in both cases.
| klyrs wrote:
| Why would it be less work? The interpreter will need to
| implement whatever operations a VM can perform, so a
| priori it's _at least as much_ work. Bonus, if you can
| bootstrap the source- >bytecode process, then you only
| need to write (and compile) that once to get a full-
| fledged interpreter on every host with a VM
| CalChris wrote:
| If you compile to AST and walk that then your portability
| is at the source level; you have to send the source over
| the wire; each target then needs a parser+walker and each
| target needs to parse+walk. If you compile to bytecode you
| can send bytecode over the wire and then simply interpret
| that bytecode.
| jhgb wrote:
| Gambit Scheme, CLISP, CMUCL are capable of interpreting
| ASTs, and I believe (although I'm not 100% sure) that this
| is the case for Lispworks and Allegro CL as well. Also,
| Ruby until version 1.8.
| IainIreland wrote:
| One major benefit of compiling to bytecode first is that
| bytecode is a more convenient shared source of truth.
|
| For example, SpiderMonkey has two interpreter and two
| compiler tiers. The output of the parser is bytecode. We
| can interpret it immediately, and it's a convenient input
| to the compilers. It also simplifies transitions between
| tiers: for example, when we bail out of Warp, we resume at
| a specific bytecode offset in the baseline interpreter.
|
| I'm not sure how you would resume at a particular point in
| the execution of an AST-based interpreter without agreeing
| on a linearized traversal of the AST, which is 90% of the
| way to bytecode.
| quelltext wrote:
| Keep in mind that bytecode interpreted languages (Ruby,
| Python) are typically called interpreted languages. Java is
| usually called "compiled" because of the explicit step vs.
| Ruby and Python, but it's essentially the same. And typically
| you'll find discussions wrt to JVM about interpretation in
| Java referring to bytecode interpretation vs. compiling being
| JIT compiling.
|
| Ultimately limiting interpreters to AST interpreters is not
| quite correct. The AST is just another IR that source code
| needs to be converted to just like bytecode. And the AST is
| essentially also executed by a virtual machine.
| Interpretation of the IR (the AST, or bytecode, etc.) is one
| part of a VM. Of course in some instances the VMness of a
| runtime is more pronounced (Java) than in others (Ruby).
|
| The difference between interpretation and compilation is that
| the latter is meant to run on real hardware vs. the former
| implies something that executes the IR of a program by
| dynamically choosing which machine instructions to run.
|
| Of course a compiler is also something that takes in some
| code and outputs some other, typically more low level
| representation.
|
| My point being I don't think there is a strict or consistent
| definition these days for what an interpreter is.
|
| Case in point: I've also heard people say interpreters read
| the code "line by line" (or rather, not that but more
| accurately, as far as they need to read before they know what
| to execute next) and execute it piece by piece. Which might
| be how some interpreters worked in some languages in the
| past. An AST interpreter OTOH already implies some whole
| source file pre-processing. Is an AST interpreter then less
| of an interpreter than one that streams the source code line
| by line. Is a program that does that more an interpreter than
| another which goes a step further and, e.g. executes a
| simplified AST, or one that executes a straightforward
| trivial bytecode representation of the simplified AST?
| CodeArtisan wrote:
| A virtual machine executes instructions while an interpreter
| takes source code as input. An interpreter could be built on
| top of a virtual machine, obviously, but not necessary. For
| example, SICP/PLAI/OOPLAI[1] explain how to implement an
| interpreter on top of Scheme where you directly interpret the
| s-expressions. These may be a worth read if you want to learn
| about the usual concepts and techniques used in programming
| language implementations from a more general point of view.
| Like, for example, how to implement a static type system or an
| object model based on classes. Interpreters based on a virtual
| machine are actually compilers on the fly; the source code is
| not interpreted but translated into instructions beforehand.
|
| [1] http://sarabander.github.io/sicp/
|
| http://cs.brown.edu/~sk/Publications/Books/ProgLangs/2007-04...
|
| https://users.dcc.uchile.cl/~etanter/ooplai/
| iTokio wrote:
| By the author of the excellent https://c9x.me/compile/
|
| A lightweight alternative to llvm
| https://c9x.me/compile/doc/llvm.html
| dmitriid wrote:
| I'm not an expert by far, but I feel like many people who write
| compilers or develop languages no longer recommend The Dragon
| Book as it's very outdated.
|
| Additionally, the vast majority of compiler books and resources
| spend too much time on the easiest part of compiling: parsing.
| It's not uncommon for a book or a course dedicate 60-70% of its
| time to parsing. And leaves nothing to everything else.
| MaxBarraclough wrote:
| Yep, here's a thread from two years ago on exactly that:
| https://news.ycombinator.com/item?id=20436399
| rualca wrote:
| > Additionally, the vast majority of compiler books and
| resources spend too much time on the easiest part of compiling:
| parsing.
|
| I disagree. Parsing is the single most important feature of
| building a programming language. It influences how the language
| can be structured and also how it should be structured for
| performance reasons.
|
| Also, you're far more likely to write a parser for a language
| than implement anything remotely related to compilers and
| interpreters.
|
| It's one thing to state that some topics might justify more
| content, but it's absurd to downplay the fundamental importance
| of parsers in putting together a compiler.
| tom_mellior wrote:
| The problem with the old old Dragon Book's treatment of
| parsing is that it explains a particular impractical
| algorithm (LR(1)? LR(k) maybe? it's been a while) in way too
| much detail. You would need that detail if you were to
| implement a parser generator (for that specific algorithm),
| or a parser using that specific algorithm, which isn't
| terribly well suited for generating useful error messages for
| your users.
|
| Today's computers are ridiculously powerful, and the
| perceived performance advantages of LR parsing are simply
| irrelevant. Further, even if you were to implement an LR
| parser, you would do it with the aid of a parser generator,
| which hides the boring details and saves you from needing to
| understand the LR algorithm itself. But more likely when you
| write a parser you use a parser generator using a different
| approach, or you write a recursive descent parser by hand. In
| either case, the dry parts of the old old Dragon Book are
| completely irrelevant.
|
| The people criticizing the 1986 Dragon Book's treatment of
| parsing aren't saying that parsing is irrelevant. They are
| saying that parsing _that specific way_ is either irrelevant
| or done for you by a tool that is not the interesting part of
| a compiler.
| jhgb wrote:
| > I disagree. Parsing is the single most important feature of
| building a programming language. It influences how the
| language can be structured and also how it should be
| structured for performance reasons.
|
| Wirth already settled that question (of both structuring the
| language and performance) with recursive descent over thirty
| years ago. The necessary treatment of all things parsing is
| then covered on twenty pages of his Compiler Construction
| text.
| rualca wrote:
| > Wirth already settled that question (of both structuring
| the language and performance) with recursive descent over
| thirty years ago.
|
| "Recursive descent" is a whole class of parsing algorithms.
| Claiming that recursive descent settled the question is a
| kin to claim that compiled programming languages settled
| the question of how to develop programming languages.
|
| Moreover, even if you go as far as believing that a
| specific class of parsing algorithms settled anything, it's
| absurd to argue that people who are learning how to write a
| parser should not learn about them, specially as your
| personal choice has fundamental limitations such as left-
| recursice operations and very specific techniques to
| mitigate them.
|
| And finally, we still see articles being accepted into
| academic journals on how to parse concrete languages such
| as JSON. I'm sure we can agree that slapping together
| something with a random parser generator is not something
| that servers the students' best interests.
| jhgb wrote:
| It turns out that whatever limitations there are to that
| approach hardly matter in practice, since Wirth developed
| a wide variety of languages this way, and they work just
| fine.
| rualca wrote:
| > It turns out that whatever limitations there are to
| that approach hardly matter in practice (...)
|
| How do you know if you do not learn about parsers?
|
| Do you see where I am going with this?
|
| It's absurd to claim that parsers are not fundamental and
| central to develop compilers. That statement is patently
| false, specially considering that some programming
| languages are notoriously challenging to parse.
| jhgb wrote:
| > That statement is patently false, specially considering
| that some programming languages are notoriously
| challenging to parse.
|
| Sure, if you create an artificial problem in the first
| place, you need an artificial solution for it. That's why
| careful language design is even more important than
| parsers.
| HelloNurse wrote:
| If parsing "influences how the language can be structured and
| also how it should be structured for performance reasons"
| it's probably going to be a mediocre language, in which self-
| imposed constraints interfere with useful special features
| resulting in a bland design lacking competitive advantages.
|
| Pascal has already been cited in other comments as having
| been designed for ease of parsing, but the Turbo Pascal
| compiler is an isolated masterpiece, and high-quality
| compilation at decent speed is normally more useful than
| cheap compilation at ludicrous speed.
|
| Maybe parsing seems a large part of constructing a compiler
| because it's naturally the first task and because it's likely
| to be entangled with designing the syntax of the language,
| which can pose difficulties (e.g. ambiguities, context
| sensitivity, messy operator precedence).
| dmitriid wrote:
| > Parsing is the single most important feature of building a
| programming language
|
| It realy isn't. Also, it's the most trivial part of parsing a
| programming language (in a compiler).
|
| > It influences how the language can be structured and also
| how it should be structured for performance reasons.
|
| Once again, no, not really. However complex your language is,
| it can be parsed at 100 GB/s on a Raspberry Pi. Even if you
| do it with regexps. Everything that comes _after_ parsing is
| very complex and is rarely explained in any detail. And
| everything that comes _after_ actualy _directly_ influences
| the structure, the performance etc.
|
| It really makes zero difference if you decide that your
| typeing info should be written like this: T
| x = new T();
|
| or like this: let x: T = T()
|
| or like this: var x := T()
|
| All this is trivial to parse. However, deciding on how you
| implement typing, what needs to be in the languaage to
| support your choice etc., now _that_ definitely influences
| the structure, performance, trade-offs etc.
|
| How much of that is in the Dragon Book? Next to zero? But
| surely there are 4 chapters on how to parse this.
| rualca wrote:
| > It realy isn't. Also, it's the most trivial part of
| parsing a programming language (in a compiler).
|
| No,not really. I mean, think about it for a second: while
| you cannot get a compiler out of the door withot a parser,
| you can indeed leave all the fancy optimization features
| out, don't you? And if you want to get a MVP out of the
| door, do you focus your effort in developing optional
| features or the basic, critical part that ensures that your
| language exists?
|
| And then there's developer experience. Syntax/grammar
| errors? You need the parser for that, don't you agree?
|
| > Once again, no, not really. However complex your language
| is, it can be parsed at 100 GB/s on a Raspberry Pi.
|
| Actually, you're quite wrong. RapidJSON only handles a very
| basic language, and doesn't need to handle any weird errors
| to output meaningful errors, and it boasts at most 1GB/s.
|
| And this is a parser optimized for speed.
|
| Perhaps if more people had a clue about compilers, this
| sort of errors and misconceptions wouldn't be so common.
| estebank wrote:
| > while you cannot get a compiler out of the door withot
| a parser, you can indeed leave all the fancy optimization
| features out, don't you? And if you want to get a MVP out
| of the door, do you focus your effort in developing
| optional features or the basic, critical part that
| ensures that your language exists?
|
| > And then there's developer experience. Syntax/grammar
| errors? You need the parser for that, don't you agree?
|
| These two statements are at odds with each other as
| having great developer experience is indeed a "fancy
| optimization" that you shouldn't spend too much time on
| when starting a project (and I say that as someone
| spending their time every day thinking about making
| compiler errors better), and part of the reason the
| literature around parsers is problematic is because
| graceful error recovery isn't explored in detail.
|
| > And this is a parser optimized for speed.
|
| Parse time is never the bottleneck. It is a waste of time
| to focus on optimizing parse speed at the beginning. It
| is worth it to eventually measure and tune it, and to
| think about the implications when designing your grammar,
| but beyond that the slowness of compilers is never driven
| by their parsers.
|
| > you cannot get a compiler out of the door withot a
| parser
|
| Technically, you can: you make your compiler something
| that consumes an already existing bytecode (like the
| JVM's or .Net's CLI) or only implement an AST parser that
| isn't intended for humans to actually write while you
| nail down the specifics.
|
| > Syntax/grammar errors? You need the parser for that,
| don't you agree?
|
| Error recovery in a parser is a tricky thing: it is
| intimately tied to the choices you've made in your
| grammar. The more redundancy and uniqueness between
| scopes you can have, the easier it can be for your parser
| to identify where things have gone badly. For example, in
| python if you write if foo = bar:
| print(foo)
|
| you have little information but can still infer that you
| likely wanted to compare equality instead using ==. But
| because Python has a very flexible grammar and semantics,
| the following is valid but likely not what was intended:
| class Foo: def bar(self):
| print("hi!") foo = Foo() foo.bar =
| "hi!"
|
| If you make different more restrictive decisions on both
| grammar and semantics you can still provide tools to
| allow people to do this, by being more verbose, but
| people who do this by accident can be given excellent
| diagnostics.
| dmitriid wrote:
| > I mean, think about it for a second: while you cannot
| get a compiler out of the door withot a parser, you can
| indeed leave all the fancy optimization features out,
| don't you? And if you want to get a MVP out of the door,
| do you focus your effort in developing optional features
| or the basic, critical part that ensures that your
| language exists?
|
| Exactly: _an MVP_. For an _MVP_ the parser may indeed be
| "the single most important feature of building a
| programming language.". But even then it's almost
| definitely isn't.
|
| The moment you "leave out fancy optimisation fetures"
| your language becomes at most mediocre.
|
| Because between parsing (which is trivial) and "fancy
| optimisation features" (which come very late in a
| compiler life) there are a million things that are
| significantly more important than parsing:
|
| - how do you do typing?
|
| - how do you code generation (do you output machine code
| yourself or use a backend)?
|
| - do you use an intermediate representation?
|
| - what are the actual optimisations on performance (none
| of which have anything to do with parsing)? As an
| example, C++'s horrible header system has nothing to do
| with parsing, but slows compiling by hundreds of orders
| of magnitude.
|
| - how do you do graceful error recovery and reporting?
|
| - how are lambdas handled? how are scopes handled? memory
| layouts? GC/ARC/manual memory management? standard lib?
|
| - a million other things
|
| > Actually, you're quite wrong.
|
| It was a hyperbole. Do you even understand how fast 1GB/s
| is? It roughly 10 to 20 million lines of code [1] Windows
| 10 is roughly 50 million lines [2]. So, at 1GB/s you can
| parse the _entirety of Windows 10 codebase_ in under 5
| seconds.
|
| Even if your parsing speed is 1000 times slower (3 orders
| of magnitude slower), you can still parse ~10k lines a
| second. And I honsetly can't even begin to think how you
| can parse something this slowly. ANd if you do, it's
| still not a concern, because the rest comes from outside
| parsing, for example:
|
| - can yo run parsers in parallel?
|
| - do you need to parse the entirety of your project to
| begin compilation, or your files/modules are isolated and
| compiler can work on any of those?
|
| Parsing is _trivial_.
|
| [1] For example, this SO answer for an unrelated question
| creates a file with 150 million lines, and the result is
| 6GB. 150/6 = 25, I went for even lower numbers.
|
| [2] https://answers.microsoft.com/en-
| us/windows/forum/all/window...
| tom_mellior wrote:
| > I'm not an expert by far, but I feel like many people who
| write compilers or develop languages no longer recommend The
| Dragon Book as it's very outdated.
|
| Yes. There are newer editions that I can't judge, but the 1986
| edition linked from the page should nowadays only be read by
| historians. If you want to implement a compiler, look
| elsewhere, for something newer.
|
| This isn't to say that this was a bad book 35 years ago. It
| wasn't. But it is simply outdated. It does not contain things
| that are important today. It has very little treatment of type
| systems and type inference, for example. Its treatment of data
| flow analysis is based on intervals, which have limitations
| (though also some interesting theoretical advantages) and are
| used by zero compilers written in this millennium (and not even
| described very well in the Dragon book IIRC). And then of
| course there is the parsing problem, see my rant elsethread.
|
| As for the article, I think A Catalogue of Optimizing
| Transformations (https://www.clear.rice.edu/comp512/Lectures/Pa
| pers/1971-alle...) would make an interesting addition.
|
| Edit: Forgot to add that the old old Dragon book doesn't cover
| two topics that the article itself considers very relevant: SSA
| form (which I believe was discovered just around that time) and
| linear scan register allocation (which is much newer).
| aarchi wrote:
| What would you recommend that covers data flow analysis
| better than the dragon book? I'm currently starting data flow
| analysis for my own compiler using intervals and/or modular
| congruences (operations can be reduced to congruences, then
| combined with the Chinese Remainder Theorem). What are the
| theoretical advantages of intervals and what do you recommend
| over that?
| tom_mellior wrote:
| Ah, sorry, I should probably clarify that there are two
| completely distinct uses of the term "interval" in data
| flow analysis.
|
| One (introduced in https://amturing.acm.org/p137-allen.pdf
| and discussed in the old Dragon Book) groups _program
| points_ into nested "intervals", i.e., intervals are
| pieces of code. Analysis information is then propagated
| among these intervals. This is what I was referring to
| above. As far as I know this approach is completely
| obsolete by now, with everyone using methods based more or
| less indirectly on Kildall's worklist iteration approach (h
| ttp://static.aminer.org/pdf/PDF/000/546/451/a_unified_appro
| ...). The theoretical advantage I was referring to above
| was that the interval approach is AFAIR sometimes more
| efficient because it can take program structure (loop
| nesting) into account more directly.
|
| The _other_ meaning of "intervals" is used for an
| _analysis domain_ , i.e., the kind of data you want your
| analysis to compute. For example, you might determine that
| the possible values of some variable x are in the interval
| [10, 20], the values of y are in the interval [5, 15], and
| then you can compute that the values of z = x + y must be
| in [15, 35]. There is nothing wrong with this approach, and
| any optimizing compiler is pretty much forced to use
| something like this. Based on your reference to modular
| congruences I think you probably want to do analysis of
| numerical values and meant this sense, which is anyway the
| only one in common use nowadays. Go ahead and do this.
| Extensions are with congruences, like "x is in [10, 20] and
| also congruent to 1 modulo 4" and/or "known bits", like "x
| is in [10, 20], and I know that (counting from the LSB) its
| bit 2 is 0 and bit 3 is 1".
|
| Personally I learned about data flow analysis in a
| university course based on Principles of Program Analysis:
| https://www.springer.com/gp/book/9783540654100. It's a book
| that goes into a lot of mathematical depth but not much
| breadth of analyses, and it doesn't handle data flow
| analysis on SSA form. Then I learned additional details
| from various well-known papers like the ones linked above,
| and the classical early papers around analysis on SSA form,
| like https://www.cse.wustl.edu/~cytron/531Pages/f11/Resourc
| es/Pap.... I'm sure there must be a more straightforward
| text to get you there, but I don't think I can give a
| definite recommendation. I have vague memories of the
| coverage in Muchnick's "Advanced Compiler Design and
| Implementation" and/or Cooper and Torczon's "Engineering a
| Compiler" being reasonable, but I can't find my copies
| right now.
| Rochus wrote:
| > no longer recommend The Dragon Book as it's very outdated
|
| Only if they didn't realize there is a second edition from
| 2006.
| testrvb wrote:
| test
___________________________________________________________________
(page generated 2021-04-24 23:00 UTC)