https://cs.lmu.edu/~ray/notes/languagedesignnotes/ Language Design So, you want to design your own language? Of course you do. Or perhaps you are taking a class and are being forced to create a programming language under penalty of a bad grade. What kinds of things do you need to know? Prerequisites It helps to be experienced. But it's okay if you're not--you can get lucky, too! You should have a good sense of: * Programming Paradigms: imperative declarative structured object-oriented functional applicative concatenative logic protocol-oriented aspect-oriented array event-driven dataflow agent-based etc. * Programming Language Concepts: Sequencing, conditional execution, iteration, recursion, functional decomposition, modularity, synchronization, metaprogramming, binding, scope, extent, volatility, subtyping, pattern matching, type inference, closures, prototypes, introspection, instrumentation, annotations, decorators, memoization, traits, streams, monads, actors, mailboxes, comprehensions, continuations, wildcards, promises, regular expressions, proxies, transactional memory, inheritance, polymorphism, parameter modes, etc. Study Existing Languages It's nice to have a feel for a variety of languages. Here are a few that are good to be familiar with (and the reasons why they are good to study): * Python, for basic imperative programming and scripting * Smalltalk, for OOP done beautifully, and Ruby for copying it, sort of, ... * JavaScript, for event-driven features, async, and promises * CoffeeScript, for ways to make syntax concise * Io, for ideas on building up everything from a minimal semantics * Julia, for multiple dispatch * Clojure, Racket, Scheme, and Common Lisp, for Lispiness (macros, etc.) * Standard ML, OCaml, F#, and Elm, for Hindley-Milner typing and more * Haskell, for typeclasses and functional purity * eToys, Scratch, Snap!, and GP for the blocks approach * Erlang and Elixir, for expressing concurrent processes * Java and C#, for being enterprisey * Kotlin, Ceylon, and Scala, as examples of evolving Java * PureScript, TypeScript, and ClojureScript, as examples of evolving JavaScript * C++ and Rust, for pointers and other system constructs * Go and Swift, for more examples of modern, but generally mainline, ideas * J and K, for array programming * Idris, for dependent typing * Prolog and Mercury, for logic programming * Factor, for concatenative programming * Quipper, because it's a language for quantum computing * Brainfuck, Malbolge, LOLCODE, Whitespace, and other classic esoterics * GolfScript, CJam, Pyth, Jelly, and other golfing languages, because this gives you a good sense of putting power into small syntactic constructs * Piet and Hexagony, for, well, check them out.... Also check out this mini-encyclopedia of 70 languages. Here are some excellent cross-language comparisons that help you to hone your understanding of how different syntaxes can express the same idea: * Syntax Across Languages * Wikipedia Programming Language Comparisons * Hyperpolyglot These are really good too: * Showcase of Languages * Languages Through the Years Don't Forget The Literature Remember, many people have designed languages before you. They made mistakes. They came up with brilliant ideas. Many were wildly successful. Some never made it big. Some people have brought in years of research on how people think and learn to come up with principles for language (and environment) design. You should learn form their experiences. Study classic papers. Read web essays. Here is a small sampling: * Hints on Programming Language Design by Tony Hoare * Learnable Programming by Bret Victor * Five Questions about Language Design by Paul Graham * For early classic papers, check out the massive bibliography of this paper * Also look at the papers from Kathleen Fisher's Programming Langauge Design class Think about the future: Sketch and Prototype Ready to strike out on your own? Do the following, in no particular order: * Understand the audience that the language is designed for, and what kinds of things they want to create with it, or problems they want to solve with it. * Sketch programs or program fragments in your language. * Sketch fragments that are not in your language. * Come up with a list of semantic capabilities, or features. Make sure they enable programmers to express their creations by following the suggestions and principle in the Learnable Programming essay, including: + Make meaning transparent + Explain in context + Make flow and time tangible and visible + Show the data, and avoid hidden state + Use metaphor + Allow easy decomposition and recomposition + Allow the programmer to write readable code--you do support mentioning parameter names in calls, right? * Come up with a nice, consistent, syntactic theme. Then: * What will you call your language, and why? Remember that design is an iterative process, and creativity is enabled and enhanced with immediate feedback. So you should design with tools that allow you to experiment and test your ideas. You should use the Ohm Editor. Use the Ohm Editor A great tool for experimenting with programming languages is the Ohm Editor. ohm-editor.png In the upper left panel, design your grammar. You can load/save from your browser's local storage, and even publish gists to GitHub. In the upper right panel, enter test cases: both tests you want to succeed (thumbs up) and those you want to fail (thumbs down). The bottom panel is an interactive concrete syntax tree for the currently selected test case. This tool will save you a lot of time. Things To Think About Okay I'm just going to rattle off a bunch of this that come into my head that you may or may not want to think about when you design your language: * Are you trying to make a reasonable, usable language or an esoteric/joke/golfing/weirdo language? * Do you have a specific audience in mind? Artists? Graphic designers? AI researchers? Numeric and Scientific nerds? Natural language types? Game developers? Machine learning people? Animators? High performance folks? System programmers? Or do you want a general purpose language? Or do you just want to do what you want? * Do you want to be pragmatic, idealistic, researchy, or evil? * Do you want your language firmly in one camp--OO, functional, logic, concatenative, plain-imperative--or be a multiparadigm symphony? Or a multiparadigm cacophany? * Do you want a language built on a single characteristic building block (a crystallization of style) or one with a huge variety of syntactic forms for multiple semantic aspects (an agglutination of features)? * What is your scheme for showing structure? Curly-brace (JavaScript, Java, C++, C#), Terminal-end (Ruby, Ada), Nested parentheses (Lisp, Clojure, Racket), Indentation (Python, CoffeeScript), Blocks (EToys, Scratch, Snap!), Pictures (Piet), Other (Haskell, Erlang, Prolog)? * If your language supports functions: + Must you pass the exact number of arguments the function expects? + Can you specify parameter names in the call? + Do you have rest parameters? optional parameters? keyword parameters? default parameters? + Are the parameters typed? + Can parameters be modified? Are they passed by value, value-result, reference, name, etc? + Are arguments evaluated L-R, R-L, in parallel, or in arbitrary order? + How does a function return a value? Can it return zero or more values or just one? + Are functions first-class? If so, do you use deep binding or shallow binding? + Is recursion allowed? + Can you test functions for equality? + Can functions be anonymous? * What is your type system like? + Static vs. dynamic, strong vs. weak, manifest vs. implicit? + Do you distinguish primitive types from reference types? + Do you have all those different numeric types based on bit size? What about decimals and ratios? + Do you have a separate character type? A separate boolean type? + Can you get the type of an expression at runtime? + Are types objects? + Do you have supertypes and subtypes? Multiple-inheritance? If so, how do you handle name collisions? + Are types the same as classes? Can you add new types? + Do you have both sum and product types? + Do functions have a type? + Are there pointer types? + Are there parameterized types? If so, are they (mostly) covariant, contravariant, or invariant? + Any dependent types? * What do your expressions look like? + And does your runtime follow eager or lazy evaluation? Or both? + Do you have only prefix, only postfix, only infix, or a wild mix of operators? + Can you overload operators? If so, how? + Can you change the precedence of operators? Even the built-in ones? At runtime? + How much type inference do you have? + Can you mark variables as mutable or immutable? + Are your variables bound or assigned? Can they be assigned more than once? + Do you have destructuring and/or pattern matching? + How do you determine scope? Do you implicitly or explicitly import into inner scopes? + Are there keywords that define access like public, private, and protected, or are there conventions, like in Go, where capitalized entities are implicitly exportable and lower-cased entities are private? + Is there a let-expression for super-local scopes? + Is shadowing allowed? + Do you have anything like JavaScript's this expression? * How do you express control flow? + Do you like expression-orientation or do you have real statements? + How do you express sequential flow vs. nondeterministic flow vs. parallel flow? + If you have nondeterminism, how do you ensure fairness? + Must guards in a multiway selection be executed in any particular order? + Do you have short-circuit operators? Iterator objects? Generators? + Do you have all the loops: (1) forever, (2) n times, (3) while/until a condition, (4) through a numeric range (with an optional step size), or (5) through a collection? + Do you have break and continue? Anything like Ruby's retry and redo? + Do you have exceptions? Or do you need Go-style constructs? Or do you have nullables everywhere? + Do you have a timer for sleeping, delaying execution, or running on intervals? + Can I haz goto? * How do you support concurrency? + Threads, or events? + Shared memory only? Message passing only? Or both? If shared memory, is it mutable? If so, do you have locks, higher-level synchronization devices, or some kind of transactional memory? + Do you support different levels of granularity of concurrency? + Are tasks spawned implicitly when their enclosing block begins, or do they require explicit invocation? + Do tasks die when the enclosing block or spawning task terminates? Or does the spawner wait for all internal tasks to terminate? + Do taks know who started them? + If an asynchronous task is launched, how is a result obtained? Callback, promise, or some other mechanism? + Is message passing done via named channels, named processes, or bulletin board? + Is message passing always synchronous, asynchronous, or both? + Is there timed or conditional synchronization mechanisms? + Can tasks detect their own state? * How meta are you? + Can a program get a list of its global variables, loaded classes, top-level functions, etc.? Can a class get a list of its fields, methods, constructors, superclass(es), subclasses, etc.? Can a function get its parameters, locals, return types, etc.? + Can a function find out who called it? + Can a variable be read or set via its name (a string) only? Can a function or method be invoked by its name (a string) only? Can new variables or types or functions be created at run time, given only their name and some indication of how they should look? + Does the language have a macro system? If so, is it string-based or AST-based? + Is it possible to unquote within a macro? + Are macros hygienic? What syntactic devices are provided to explicitly capture in- scope variables, if any? [God creating JavaScript] GOD: It uses prototype-based inheritance. Angel: Nice. GOD: Also it secretly adds semicolons to ur code. A: wat-- Neckbeard Hacker (@NeckbeardHacker) August 24, 2016 Theoretical Concerns Moving on from our particular example, let's look at things from a more academic perspective and consider real-life programming language definitions. There seem to be three ways to produce a language definition: * An official document, with a mix of formal notation and informal descriptions. * An official document, with 100% of the definition specified in a formal notation. * A reference implementation, namely a compiler, so that the language definition is "whatever this program does." Some language definitions are sanctioned by an official standards organization (like ISO, IEC, ECMA, ANSI, etc.) while some don't even care about standardization. Exercise: Create a bibliography of the official standards for the major programming languages. Usually a language is defined by considering its: Syntax Structure Semantics Pragmatics Pragmatics Usage Syntax We have a lot of options for defining a syntax: * CFG (Context Free Grammar) * BNF (Backus-Naur Form) * EBNF (Extended Backus-Naur Form) * ABNF (Augmented Backus-Naur Form) * Syntax Diagrams (a.k.a. Railroad Diagrams) * PEG (Parsing Expression Grammar) * Ohm Grammar (variation on the classic PEG) The first five forms are all equivalent. They describe exactly the class of context-free languages. PEGs capture a different set of languages, including some context-sensitive languages like $a^nb^nc^ n$. It turns out general parsing of CSGs is hard or inefficient. So we normally give a grammar for the context-free parts and leave the context-sensitive parts, like "variables must be declared before being used," to the semantics. Exercise: Find out if there are any known complexity bounds (upper or lower) for parsing a context sensitive language. Syntax is usually (but not always) divided into: * A microsyntax, specifying how the characters in the source code stream are grouped into tokens. The microsyntax deals with whitespace, comments, and case sensitivity. * A macrosyntax, speficying how the tokens are grouped into phrases (such as expressions, statements, declarations, etc.) * An abstract syntax, which is a much-simplified restructuring of the macrosyntax. Semantics A language's semantics is specified by mapping its syntactic forms (often abstract syntax tree fragments) into their meaning. Common approaches include: * Natural Language (informal) * "Compile it and run it" (in which the compiler itself defines the language, and thus the compiler correctness problem is trivial--the compiler is by definition always correct) * Denotational Semantics * Operational Semantics * Axiomatic Semantics * Action Semantics A hugely important distinction is that between: * Static Semantics: which deals with legality rules--things you can check before running the code (compile time), and * Dynamic Semantics: which deals with the execution behavior; things that can only be known at runtime. Example: You've probably heard the distinction between "static" and "dynamic" before. Recall that a statically-typed language in which type checking is done prior to program execution and a dynamically typed language is one in which type checking is done during program execution. Most languages do a little of both, but one or the other usually predominates. Pragmatics Pragmatics does not affect the formal specification of programming languages. However, pragmatic concerns must guide your design of a programming language, if you want it to be easy to read, easy to write, and able to be implemented efficiently. Pragmatics encompasses: * Common programming idioms (the right ways and the wrong ways of doing things) * Programming environments, e.g., IDEs, REPLs, workspaces, playgrounds, playpens * The standard library or libraries * The ecosystem for 3rd party libraries (e.g. NPM for JavaScript, Pip for Python, Gems for Ruby, Rocks for Lua, Maven for Java, ...)