https://cs.lmu.edu/~ray/notes/languagedesignnotes/

Language Design

So, you want to design your own language? Of course you do. Or
perhaps you are taking a class and are being forced to create a
programming language under penalty of a bad grade. What kinds of
things do you need to know?

Prerequisites

It helps to be experienced. But it's okay if you're not--you can get
lucky, too!

You should have a good sense of:

  * Programming Paradigms: imperative declarative structured 
    object-oriented functional applicative concatenative logic 
    protocol-oriented aspect-oriented array event-driven dataflow 
    agent-based etc.
  * Programming Language Concepts: Sequencing, conditional execution,
    iteration, recursion, functional decomposition, modularity,
    synchronization, metaprogramming, binding, scope, extent,
    volatility, subtyping, pattern matching, type inference,
    closures, prototypes, introspection, instrumentation,
    annotations, decorators, memoization, traits, streams, monads,
    actors, mailboxes, comprehensions, continuations, wildcards,
    promises, regular expressions, proxies, transactional memory,
    inheritance, polymorphism, parameter modes, etc.

Study Existing Languages

It's nice to have a feel for a variety of languages. Here are a few
that are good to be familiar with (and the reasons why they are good
to study):

  * Python, for basic imperative programming and scripting
  * Smalltalk, for OOP done beautifully, and Ruby for copying it,
    sort of, ...
  * JavaScript, for event-driven features, async, and promises
  * CoffeeScript, for ways to make syntax concise
  * Io, for ideas on building up everything from a minimal semantics
  * Julia, for multiple dispatch
  * Clojure, Racket, Scheme, and Common Lisp, for Lispiness (macros,
    etc.)
  * Standard ML, OCaml, F#, and Elm, for Hindley-Milner typing and
    more
  * Haskell, for typeclasses and functional purity
  * eToys, Scratch, Snap!, and GP for the blocks approach
  * Erlang and Elixir, for expressing concurrent processes
  * Java and C#, for being enterprisey
  * Kotlin, Ceylon, and Scala, as examples of evolving Java
  * PureScript, TypeScript, and ClojureScript, as examples of
    evolving JavaScript
  * C++ and Rust, for pointers and other system constructs
  * Go and Swift, for more examples of modern, but generally
    mainline, ideas
  * J and K, for array programming
  * Idris, for dependent typing
  * Prolog and Mercury, for logic programming
  * Factor, for concatenative programming
  * Quipper, because it's a language for quantum computing
  * Brainfuck, Malbolge, LOLCODE, Whitespace, and other classic
    esoterics
  * GolfScript, CJam, Pyth, Jelly, and other golfing languages,
    because this gives you a good sense of putting power into small
    syntactic constructs
  * Piet and Hexagony, for, well, check them out....

Also check out this mini-encyclopedia of 70 languages.

Here are some excellent cross-language comparisons that help you to
hone your understanding of how different syntaxes can express the
same idea:

  * Syntax Across Languages
  * Wikipedia Programming Language Comparisons
  * Hyperpolyglot

These are really good too:

  * Showcase of Languages
  * Languages Through the Years

Don't Forget The Literature

Remember, many people have designed languages before you. They made
mistakes. They came up with brilliant ideas. Many were wildly
successful. Some never made it big. Some people have brought in years
of research on how people think and learn to come up with principles
for language (and environment) design.

You should learn form their experiences. Study classic papers. Read
web essays. Here is a small sampling:

  * Hints on Programming Language Design by Tony Hoare
  * Learnable Programming by Bret Victor
  * Five Questions about Language Design by Paul Graham
  * For early classic papers, check out the massive bibliography of
    this paper
  * Also look at the papers from Kathleen Fisher's Programming
    Langauge Design class

Think about the future:

Sketch and Prototype

Ready to strike out on your own? Do the following, in no particular
order:

  * Understand the audience that the language is designed for, and
    what kinds of things they want to create with it, or problems
    they want to solve with it.
  * Sketch programs or program fragments in your language.
  * Sketch fragments that are not in your language.
  * Come up with a list of semantic capabilities, or features. Make
    sure they enable programmers to express their creations by
    following the suggestions and principle in the Learnable
    Programming essay, including:
      + Make meaning transparent
      + Explain in context
      + Make flow and time tangible and visible
      + Show the data, and avoid hidden state
      + Use metaphor
      + Allow easy decomposition and recomposition
      + Allow the programmer to write readable code--you do support
        mentioning parameter names in calls, right?
  * Come up with a nice, consistent, syntactic theme.

Then:

  * What will you call your language, and why?

Remember that design is an iterative process, and creativity is
enabled and enhanced with immediate feedback. So you should design
with tools that allow you to experiment and test your ideas.

You should use the Ohm Editor.

Use the Ohm Editor

A great tool for experimenting with programming languages is the Ohm
Editor.

ohm-editor.png

In the upper left panel, design your grammar. You can load/save from
your browser's local storage, and even publish gists to GitHub. In
the upper right panel, enter test cases: both tests you want to
succeed (thumbs up) and those you want to fail (thumbs down). The
bottom panel is an interactive concrete syntax tree for the currently
selected test case.

This tool will save you a lot of time.

Things To Think About

Okay I'm just going to rattle off a bunch of this that come into my
head that you may or may not want to think about when you design your
language:

  * Are you trying to make a reasonable, usable language or an
    esoteric/joke/golfing/weirdo language?
  * Do you have a specific audience in mind? Artists? Graphic
    designers? AI researchers? Numeric and Scientific nerds? Natural
    language types? Game developers? Machine learning people?
    Animators? High performance folks? System programmers? Or do you
    want a general purpose language? Or do you just want to do what
    you want?
  * Do you want to be pragmatic, idealistic, researchy, or evil?
  * Do you want your language firmly in one camp--OO, functional,
    logic, concatenative, plain-imperative--or be a multiparadigm
    symphony? Or a multiparadigm cacophany?
  * Do you want a language built on a single characteristic building
    block (a crystallization of style) or one with a huge variety of
    syntactic forms for multiple semantic aspects (an agglutination
    of features)?
  * What is your scheme for showing structure? Curly-brace
    (JavaScript, Java, C++, C#), Terminal-end (Ruby, Ada), Nested
    parentheses (Lisp, Clojure, Racket), Indentation (Python,
    CoffeeScript), Blocks (EToys, Scratch, Snap!), Pictures (Piet),
    Other (Haskell, Erlang, Prolog)?
  * If your language supports functions:
      + Must you pass the exact number of arguments the function
        expects?
      + Can you specify parameter names in the call?
      + Do you have rest parameters? optional parameters? keyword
        parameters? default parameters?
      + Are the parameters typed?
      + Can parameters be modified? Are they passed by value,
        value-result, reference, name, etc?
      + Are arguments evaluated L-R, R-L, in parallel, or in
        arbitrary order?
      + How does a function return a value? Can it return zero or
        more values or just one?
      + Are functions first-class? If so, do you use deep binding or
        shallow binding?
      + Is recursion allowed?
      + Can you test functions for equality?
      + Can functions be anonymous?
  * What is your type system like?
      + Static vs. dynamic, strong vs. weak, manifest vs. implicit?
      + Do you distinguish primitive types from reference types?
      + Do you have all those different numeric types based on bit
        size? What about decimals and ratios?
      + Do you have a separate character type? A separate boolean
        type?
      + Can you get the type of an expression at runtime?
      + Are types objects?
      + Do you have supertypes and subtypes? Multiple-inheritance? If
        so, how do you handle name collisions?
      + Are types the same as classes? Can you add new types?
      + Do you have both sum and product types?
      + Do functions have a type?
      + Are there pointer types?
      + Are there parameterized types? If so, are they (mostly)
        covariant, contravariant, or invariant?
      + Any dependent types?
  * What do your expressions look like?
      + And does your runtime follow eager or lazy evaluation? Or
        both?
      + Do you have only prefix, only postfix, only infix, or a wild
        mix of operators?
      + Can you overload operators? If so, how?
      + Can you change the precedence of operators? Even the built-in
        ones? At runtime?
      + How much type inference do you have?
      + Can you mark variables as mutable or immutable?
      + Are your variables bound or assigned? Can they be assigned
        more than once?
      + Do you have destructuring and/or pattern matching?
      + How do you determine scope? Do you implicitly or explicitly
        import into inner scopes?
      + Are there keywords that define access like public, private,
        and protected, or are there conventions, like in Go, where
        capitalized entities are implicitly exportable and
        lower-cased entities are private?
      + Is there a let-expression for super-local scopes?
      + Is shadowing allowed?
      + Do you have anything like JavaScript's this expression?
  * How do you express control flow?
      + Do you like expression-orientation or do you have real
        statements?
      + How do you express sequential flow vs. nondeterministic flow
        vs. parallel flow?
      + If you have nondeterminism, how do you ensure fairness?
      + Must guards in a multiway selection be executed in any
        particular order?
      + Do you have short-circuit operators? Iterator objects?
        Generators?
      + Do you have all the loops: (1) forever, (2) n times, (3)
        while/until a condition, (4) through a numeric range (with an
        optional step size), or (5) through a collection?
      + Do you have break and continue? Anything like Ruby's retry
        and redo?
      + Do you have exceptions? Or do you need Go-style constructs?
        Or do you have nullables everywhere?
      + Do you have a timer for sleeping, delaying execution, or
        running on intervals?
      + Can I haz goto?
  * How do you support concurrency?
      + Threads, or events?
      + Shared memory only? Message passing only? Or both? If shared
        memory, is it mutable? If so, do you have locks, higher-level
        synchronization devices, or some kind of transactional
        memory?
      + Do you support different levels of granularity of
        concurrency?
      + Are tasks spawned implicitly when their enclosing block
        begins, or do they require explicit invocation?
      + Do tasks die when the enclosing block or spawning task
        terminates? Or does the spawner wait for all internal tasks
        to terminate?
      + Do taks know who started them?
      + If an asynchronous task is launched, how is a result
        obtained? Callback, promise, or some other mechanism?
      + Is message passing done via named channels, named processes,
        or bulletin board?
      + Is message passing always synchronous, asynchronous, or both?
      + Is there timed or conditional synchronization mechanisms?
      + Can tasks detect their own state?
  * How meta are you?
      + Can a program get a list of its global variables, loaded
        classes, top-level functions, etc.? Can a class get a list of
        its fields, methods, constructors, superclass(es),
        subclasses, etc.? Can a function get its parameters, locals,
        return types, etc.?
      + Can a function find out who called it?
      + Can a variable be read or set via its name (a string) only?
        Can a function or method be invoked by its name (a string)
        only? Can new variables or types or functions be created at
        run time, given only their name and some indication of how
        they should look?
      + Does the language have a macro system? If so, is it
        string-based or AST-based?
      + Is it possible to unquote within a macro?
      + Are macros hygienic? What syntactic devices are provided to
        explicitly capture in- scope variables, if any?

    [God creating JavaScript]
    GOD: It uses prototype-based inheritance.
    Angel: Nice.
    GOD: Also it secretly adds semicolons to ur code.
    A: wat-- Neckbeard Hacker (@NeckbeardHacker) August 24, 2016

Theoretical Concerns

Moving on from our particular example, let's look at things from a
more academic perspective and consider real-life programming language
definitions. There seem to be three ways to produce a language
definition:

  * An official document, with a mix of formal notation and informal
    descriptions.
  * An official document, with 100% of the definition specified in a
    formal notation.
  * A reference implementation, namely a compiler, so that the
    language definition is "whatever this program does."

Some language definitions are sanctioned by an official standards
organization (like ISO, IEC, ECMA, ANSI, etc.) while some don't even
care about standardization.
Exercise: Create a bibliography of the official standards for the
major programming languages.

Usually a language is defined by considering its:

Syntax

Structure

Semantics

Pragmatics

Pragmatics

Usage

Syntax

We have a lot of options for defining a syntax:

  * CFG (Context Free Grammar)
  * BNF (Backus-Naur Form)
  * EBNF (Extended Backus-Naur Form)
  * ABNF (Augmented Backus-Naur Form)
  * Syntax Diagrams (a.k.a. Railroad Diagrams)
  * PEG (Parsing Expression Grammar)
  * Ohm Grammar (variation on the classic PEG)

The first five forms are all equivalent. They describe exactly the
class of context-free languages. PEGs capture a different set of
languages, including some context-sensitive languages like $a^nb^nc^
n$.

It turns out general parsing of CSGs is hard or inefficient. So we
normally give a grammar for the context-free parts and leave the
context-sensitive parts, like "variables must be declared before
being used," to the semantics.
Exercise: Find out if there are any known complexity bounds (upper or
lower) for parsing a context sensitive language.

Syntax is usually (but not always) divided into:

  * A microsyntax, specifying how the characters in the source code
    stream are grouped into tokens. The microsyntax deals with
    whitespace, comments, and case sensitivity.
  * A macrosyntax, speficying how the tokens are grouped into phrases
    (such as expressions, statements, declarations, etc.)
  * An abstract syntax, which is a much-simplified restructuring of
    the macrosyntax.

Semantics

A language's semantics is specified by mapping its syntactic forms
(often abstract syntax tree fragments) into their meaning. Common
approaches include:

  * Natural Language (informal)
  * "Compile it and run it" (in which the compiler itself defines the
    language, and thus the compiler correctness problem is
    trivial--the compiler is by definition always correct)
  * Denotational Semantics
  * Operational Semantics
  * Axiomatic Semantics
  * Action Semantics

A hugely important distinction is that between:

  * Static Semantics: which deals with legality rules--things you can
    check before running the code (compile time), and
  * Dynamic Semantics: which deals with the execution behavior;
    things that can only be known at runtime.

Example: You've probably heard the distinction between "static" and
"dynamic" before. Recall that a statically-typed language in which
type checking is done prior to program execution and a dynamically
typed language is one in which type checking is done during program
execution. Most languages do a little of both, but one or the other
usually predominates.

Pragmatics

Pragmatics does not affect the formal specification of programming
languages. However, pragmatic concerns must guide your design of a
programming language, if you want it to be easy to read, easy to
write, and able to be implemented efficiently. Pragmatics
encompasses:

  * Common programming idioms (the right ways and the wrong ways of
    doing things)
  * Programming environments, e.g., IDEs, REPLs, workspaces,
    playgrounds, playpens
  * The standard library or libraries
  * The ecosystem for 3rd party libraries (e.g. NPM for JavaScript,
    Pip for Python, Gems for Ruby, Rocks for Lua, Maven for Java,
    ...)