[HN Gopher] A toy programming language in 137 lines of Python code
       ___________________________________________________________________
        
       A toy programming language in 137 lines of Python code
        
       Author : miguelgrinberg
       Score  : 77 points
       Date   : 2023-07-02 15:36 UTC (7 hours ago)
        
 (HTM) web link (blog.miguelgrinberg.com)
 (TXT) w3m dump (blog.miguelgrinberg.com)
        
       | mr_00ff00 wrote:
       | Great article. I've always wanted to try and build a toy language
       | for the fun of it, great to see something that breaks it down so
       | well.
       | 
       | Curious if anyone has any additional thoughts to add onto this
       | for anyone who wants to build one themselves?
       | 
       | I have experience in C++, so will likely use that or rust (I
       | imagine rust's built-in methods might make it way easier at least
       | for the parser, although idk if ownership rules get painful for
       | work like this).
        
         | benhoyt wrote:
         | Yes, my thought is to read Bob Nystrom's "Crafting
         | Interpreters". It's excellent, in-depth but not academic, and
         | the writing style is great. https://craftinginterpreters.com/"
         | He also has a lot of good thoughts about language design.
        
       | 29athrowaway wrote:
       | Once you have an intuition, you can use yacc.
        
         | laxd wrote:
         | And lose touch with your code and having too deal with problems
         | through a heavily abstracted layer. Great!
        
           | 29athrowaway wrote:
           | You will likely not get a perfect syntax right away. So you
           | will need to iterate on it until you are happy with how it
           | looks and feels, and doing it while developing the compiler
           | yourself will naturally lead to a lot of unnecessary bugs and
           | tech debt.
           | 
           | Alternatively, you iterate using a cheaper method and once
           | you reach a point in which you are content with the syntax,
           | you can consider writing a compiler for it.
           | 
           | You can also avoid iterating on the syntax so you can ship a
           | compiler as fast as possible, but users won't like the syntax
           | and won't use your language.
        
             | mcluck wrote:
             | I have a feeling you haven't worked on parsers or compilers
             | that much if you think you design all of the syntax before
             | building anything
        
           | remexre wrote:
           | I don't really think of yacc as having much more abstraction
           | than a regex engine -- like in an (NFA-based) regex engine,
           | you're building a big data structure a human wouldn't want to
           | hand-engineer from a declarative specification. (And, like in
           | regexes, you've got good theory backing it.)
        
             | laxd wrote:
             | You got good theory backing hand-written parsers, and the
             | code is there right in your face. I don't know how a regex
             | engine works.
        
               | remexre wrote:
               | By "good theory," I mean being able to know properties of
               | the parser as a result of the formalism.
               | 
               | For example, one can prove that any yacc-generated parser
               | never infinite loops and runs in O(n) (at least for the
               | actual parsing; you can write custom code in actions that
               | runs when a construct has been parsed, and of course you
               | can write whatever you want there).
               | 
               | If you hand-write a grammar like the following, you'll
               | end up with an infinite loop without restructuring the
               | grammar; such restructuring is of course much easier if
               | you have written the grammar out, rather than having it
               | implicitly exist in code.                   Expr ::= Expr
               | '[' Expr ']'         Expr ::= Name         Expr ::= Int
               | 
               | If you try to naively parse the following, you'll never
               | parse the marked production either. Something like yacc
               | will warn you about this, while an ordinary compiler
               | compiling a hand-written parser won't.
               | Stmt ::= Expr ';'         Stmt ::= Name '=' Expr ';'
               | Stmt ::= Expr '[' Expr ']' '=' Expr ';'         Expr ::=
               | Expr '[' Expr ']'         Expr ::= Name         Expr ::=
               | Int
               | 
               | [0] is a good introduction to regex matching, [1] is a
               | good introduction to the algorithm yacc uses, if you want
               | to learn either.
               | 
               | [0]: https://swtch.com/~rsc/regexp/regexp1.html [1]: http
               | s://web.archive.org/web/20210507215636/http://web.cs.dal.
               | ..
        
               | laxd wrote:
               | > For example, one can prove that any yacc-generated
               | parser never infinite loops and runs in O(n)
               | 
               | Trivial when actually writing the code. And your example
               | of infinite left recursion is basics in parser theory. No
               | one in their right mind would write a hanging program
               | like that, at least not for long. The problem just seem
               | more clever when written in grammars.
        
               | remexre wrote:
               | Left recursion is basic, sure, but it's not a problem for
               | an (LA)LR parser generator like yacc. (Depending on your
               | parse actions, you might even get slightly better memory
               | consumption when using left recursion rather than right
               | recursion.)
        
         | miguelgrinberg wrote:
         | Knowing how to build a parser manually will make you much much
         | better at generating, and most importantly debugging,
         | automatically generated parsers.
        
           | 29athrowaway wrote:
           | Yes. And once you are done learning every aspect of it, one
           | day you will look yourself in the mirror and you will be 60
           | years old.
        
             | benj111 wrote:
             | Why do you think people are writing toy programming
             | languages?
             | 
             | It's a learning exercise. Sure, if you want to learn to use
             | yacc, or whatever, use yacc. I don't think time efficiency
             | is a useful metric when the overwhelming majority of people
             | is this space aren't doing it for their job, or even
             | because they 'need' to create a new language.
             | 
             | Ultimately you could just use python, it'll be faster and
             | more fully featured.
        
             | furyofantares wrote:
             | I'm not sure why this intuition is so strong, I had it
             | myself before learning this stuff in my early 20s on
             | shipping projects (both existing, and from scratch) and at
             | the end of that I was still in my early 20s.
             | 
             | Despite this intuition it's all actually very learnable and
             | my experience is that using a parser generator makes no
             | sense in a professional context and very little sense in a
             | hobby/toy project context.
        
         | 3cats-in-a-coat wrote:
         | Once you have an intuition, you just write the parser yourself.
         | 
         | I'm not a fan of parser generators. I find them restrictive. It
         | is also notable most mainstream languages do not use tools like
         | YACC, they have "handmade" parsers and compilers.
         | 
         | This may seem like I'm suggesting pie-in-the-sky approach from
         | the ground up for something that may never need such control.
         | Could be. But I don't feel like learning YACC is easier than
         | learning how to parse anything in general. I think YACC and so
         | on are holding people back from truly expressing themselves
         | with custom languages, and after all that was the whole point
         | of them writing a language or wasn't it?
        
           | 29athrowaway wrote:
           | So let's say, you decide to create your own programming
           | language, now you can automatically afford to do as much work
           | as the organizations behind mainstream programming languages?
           | Organizations with thousands of contributors, funding, etc.?
           | 
           | Now, let's omit that for a moment. Let's say that after
           | having implemented your language after years of work, now for
           | some reason you got 1 prospect user.
           | 
           | That person will ask you about: syntax highlighting, linting,
           | code navigation, testing, packaging, documentation, operating
           | systems/architecture support, etc. Who is going to contribute
           | all that if you are busy writing everything by hand?
           | 
           | Unless you have as much time as Terry Davis, you are better
           | off starting by piggybacking on something that exists. Then,
           | once you have an idea that scales, you can convince other
           | people to help you and have a successful project. Then you
           | can have a viable ecosystem that people can comfortably join.
        
             | laxd wrote:
             | .... "syntax highlighting, linting, code navigation,
             | testing, packaging, documentation, operating
             | systems/architecture support"
             | 
             | This has nothing to do with using a parser generator or
             | not.
        
               | 29athrowaway wrote:
               | It has to do with having time to spend in creating a
               | convenient development experience and picking your
               | battles.
        
               | laxd wrote:
               | That's agreeable. We seems to disagree on how to achieve
               | that ideal - on the specific point of whether to use a
               | parser generator or not. The correct answer is not :)
        
             | jstimpfle wrote:
             | The real problem about parsing is not parsing the formal
             | grammar. It's deciding what to store about the parsed input
             | (CST to AST) and designing the ancillary datastructures for
             | further processing, integrating the parser with the rest of
             | the application, etc.
             | 
             | The real problem about _building a compiler_ is everything
             | else but not parsing. Parsing is the easiest part. Once you
             | 've written a view recursive descent parsers you will
             | probably agree.
             | 
             | The parser for syntax highlighting is probably a different
             | one than the parser for AST generation and compilation. So
             | is a parser for batch compilation different to one for
             | incremental compilation. Even the _syntax_ might be
             | different.
        
             | 3cats-in-a-coat wrote:
             | Using YACC doesn't give you magical integration with IDEs,
             | operating systems, testing, documentation and so on. I feel
             | as if you took my specific comment about parsers (which are
             | not hard to write at all, few hundred lines of code for a
             | moderately complex language) in an inappropriately general
             | way as in "don't use anything for anything when doing
             | anything".
        
           | Tade0 wrote:
           | I'm not well versed in this space. Are you suggesting
           | designing a language without specifying a formal grammar?
        
             | jstimpfle wrote:
             | One does not preclude the other. Most languages that lend
             | themselves to hand-parsing via recursive descent are
             | probably LL(1).
             | 
             | Syntaxes that are easy to parse via recursive descent are
             | probably also the ones that are easiest to parse for
             | humans.
        
       | tomp wrote:
       | short code review:
       | 
       | - the `tokens` method returning a variable-length tuple is a very
       | bad idea; modern Python supports the `datatype` decorator which
       | should be used in this case
       | 
       | - I'd strongly recommend using a regex for lexing; much easier to
       | get it right than a long if/elif/else block
       | 
       | - I'd recommend refactoring code and implementing the
       | `next_token` logic so that it works for _any_ generator; an
       | example implementation could be (WARNING: I wrote this in 5min,
       | would require extensive unittests!)                   class Peek:
       | def __init__(self, it):             self.cur, self.nxt =
       | itertools.tee(it, 2)             self.current = next(self.cur)
       | def __next__(self):             self.current = next(self.cur)
       | return next(self.nxt)
       | 
       | - statements with a "prefix" (like `print X` or `if Y`) are very
       | easy to parse (just use the lookahead token!) but when you get to
       | parsing expressions, I strongly recommend using a Pratt parser
       | (extensible operator precedence parsing)
        
         | benhoyt wrote:
         | > modern Python supports the `datatype` decorator
         | 
         | I presume you mean the @dataclass decorator?
        
           | tomp wrote:
           | indeed :D
        
       | intalentive wrote:
       | I love Python but after taking PL in Racket (a Lisp-like
       | language) it's hard to imagine implementing toy languages
       | otherwise. Seems like Python 3.10's match statements might come
       | in handy.
        
       | BoppreH wrote:
       | This is a really fun exercise, I recommend every programmer try
       | it once. You can replace most of the tokenization code with
       | re.Scanner[1], which also allows you to have strings without
       | worrying about `code.split()` messing them up.
       | 
       | [1] https://news.ycombinator.com/item?id=36517749
        
       | Fraterkes wrote:
       | This seems great. When I read these kinds of explanations, it
       | always strikes me how the way you're supposed to write a
       | programming languages is basically the same way someone who had
       | never written one might do it. Just break the input up into
       | units, map the units to some category, and connect functionality
       | to that. Like if you forced someone who had 2 weeks of
       | programming experience to come up with a way to turn code into
       | programs, they would probably think of something like the
       | lexer/parser/evaluater process if you gave them enough time
        
       | gabrielsroka wrote:
       | Actual title "Building a Toy Programming Language in Python"
        
         | codetrotter wrote:
         | I think using a different title on HN is fine when OP is the
         | author, such as here
        
           | gabrielsroka wrote:
           | I noticed that after I posted it. But I still have a "web"
           | button in my HN browser, and it doesn't work unless the title
           | is correct.
           | 
           | Of course, that author can change the title on the website as
           | well.
        
       | frgfm wrote:
       | Really great content, and an interesting exercise! A while back,
       | I wanted to do something similar but built on top of a low-level
       | language to get good performances on a specific use case. I'd
       | love to see an article about something built on top of Rust or C
       | (like a retrospective analysis on success/mistake of Python over
       | the years)
        
       ___________________________________________________________________
       (page generated 2023-07-02 23:00 UTC)