[HN Gopher] Design Guidelines for Domain Specific Languages (2014)
       ___________________________________________________________________
        
       Design Guidelines for Domain Specific Languages (2014)
        
       Author : lr0
       Score  : 123 points
       Date   : 2023-11-09 02:37 UTC (20 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | valenterry wrote:
       | Unfortunately, they don't mention the most easy way of defining a
       | DSL: by doing it in the host language itself. Not all programming
       | languages are capable of allowing that, but some do, and in that
       | case it is by far the most easy and safe solution.
       | 
       | It is still much much easier than combining existing language
       | like adding OCL into another language, as suggest in the paper;
       | and you get compiler/IDE support, tooling support and everything
       | else from the "host language" out of the box.
        
         | MrJohz wrote:
         | I'd go a step further, and suggest that most languages are able
         | to embed some level of DSL, although you might not be able to
         | reach quite the expressiveness that you had in mind.
         | 
         | Generally a combination of block closures and being able to
         | chain methods together is usually enough to create something
         | reasonably expressive, although even then there are
         | alternatives - I've seen DSLs in Python that use the class
         | syntax and metaclasses to define quite complex this. You won't
         | be generating your own syntax or anything, but you can often
         | get pretty far.
        
         | victorNicollet wrote:
         | This is definitely the best idea for "drop down" DSLs (by drop
         | down, I mean that you are working in a general-purpose language
         | and then, for one specific situation, you drop down into the
         | DSL because it is more productive).
         | 
         | I think the article is more oriented towards standalone DSLs
         | (where the entry point is the DSL itself), as those get actual
         | maintenance benefits from not being able to invoke arbitrary
         | code in the host language.
        
           | valenterry wrote:
           | Yes. My point is just that I would judge that in 90% of the
           | cases, "drop down" DSLs are the best solution for the problem
           | at hand. So when talking about DSLs they should be mentioned,
           | or at least the term DSL should be defined accordingly, which
           | did not happen in the paper.
        
         | chriswarbo wrote:
         | Alan Kay describes OOP as defining interactions with/between
         | values via mini languages/algebras. In a language like
         | Smalltalk (using whitespace and mixfix syntax) this very much
         | feels like crafting a lightweight/hosted DSL. Other languages
         | require a lot more squinting, to get past the host language's
         | punctuation and ceremony.
         | 
         | For example, compare something like the following:
         | canvas drawA: circle at: topLeft in: red.
         | canvas.draw(circle, at=topLeft, in=red);
        
           | ljm wrote:
           | I think the discoverability of Smalltalk code within the IDE
           | lends nicely to that too. It'd be quite obscure to many these
           | days, but there's something to be said about the ability to
           | browse and experiment with a catalogue of classes like a
           | graphical REPL on steroids.
        
         | Too wrote:
         | Doesn't necessarily have to be the host language. Just _any_
         | other existing language.
         | 
         | If it's static and declarative, just use yaml. (Yes it sucks,
         | but you get it point)
         | 
         | If it's imperative, just use the host language.
         | 
         | Then you have some hybrid solutions. HCL allowing rich
         | expression in an otherwise very declarative language. A big
         | list of constant dataclasses in python is another quick way to
         | define complex data structures without having to reinvent
         | anything. This even gives you type safety for free.
         | 
         | Heck, MongoDB managed to build big business on a database whose
         | only query language is super verbose json. This isn't just
         | negative, because by leveraging existing language, a lot of
         | tooling around is ready to plug in on day one.
        
       | victorNicollet wrote:
       | A few strong opinions, a lot stronger than they were in my 2017
       | talk https://youtu.be/_gmMJwjg0Pk, surely that must be the
       | experience speaking :)
       | 
       | - Creating a DSL is a commitment. It is never done, it's either
       | in development or abandoned. Plan for this, both in terms of team
       | size and knowledge management.
       | 
       | - Don't allow escape hatches or ways to access arbitrary
       | functionality in the host language. When (not if) these are used,
       | they ruin backwards compatibility and make maintenance of DSL
       | code a nightmare.
       | 
       | - Don't use a PEG (parsing expression grammar), they are terrible
       | for backwards compatibility. Hand-written parsers are even worse.
       | Using an LR parser generator is worth the time investment.
       | 
       | - Support static analysis, or at the very least don't make it
       | impossible to implement later. It improves productivity, and
       | every error message is an opportunity to teach the DSL to the
       | users.
       | 
       | - You need a language server. The bare minimum is: highlighting,
       | auto-completion, hints on hover, find references, rename.
       | 
       | - It should be trivial to reproduce a production run on a
       | development machine. This means the language should be
       | deterministic, and the production context should be preserved for
       | later reuse, and the development machine should be safe for
       | accessing production data.
       | 
       | - Thousands of unit tests, thousands of regression tests.
       | 
       | It's fine to break these rules, if you can live with the
       | consequences.
       | 
       | And here's a brand new one from this year: use RAG to have
       | ChatGPT "read" your documentation, and ask it to produce scripts
       | for you. When the script doesn't compile, feed the error message
       | back to ChatGPT and try again until you get a working script.
       | Think deeply about why ChatGPT made some mistakes, as it's often
       | the case that humans would have the same misunderstandings.
       | Improve your documentation, error messages or syntax to eliminate
       | them, and then try again to see if it works. We have been doing
       | this for 6 months now, and there have already been some
       | significant improvements that are only obvious in hindsight. And
       | we've now opened a PhD position to investigate whether we can
       | have a separate "LLM-friendly" syntax for the language, dedicated
       | to making code generation easier.
        
         | avindroth wrote:
         | Hey Victor, these are amazing tips. I am currently building a
         | DSL for LLMs, and would love your feedback/consulting. I could
         | not find your email online.
         | 
         | Please feel free to reply to this or reach me here:
         | @eating_entropy on X joshcho@stanford.edu
         | 
         | Would love advisement from someone who has thought about DSLs
         | extensively!
        
         | danielvaughn wrote:
         | I'm in the midst of creating a DSL (it's my first try) and this
         | sounds like really good advice. I'm not clear on the third item
         | though - by LR parser generator are you talking about something
         | like ANTLR?
        
           | victorNicollet wrote:
           | Yes, something like ANTLR for Java, or the entire list here: 
           | https://en.wikipedia.org/wiki/Comparison_of_parser_generator.
           | ..
           | 
           | I've had good experiences with Menhir (for OCaml) and Tree-
           | sitter, and implemented my own SLR parser generator for C#
           | https://github.com/Lokad/Parsing
           | 
           | In the end, what matters is that they should be able to
           | report conflicts and ambiguities.
        
             | _a_a_a_ wrote:
             | Are you saying PEGs can't conflicts and ambiguities? I
             | didn't know that.
        
               | victorNicollet wrote:
               | Not having ambiguities is actually the main selling point
               | of PEGs. If you have two rules A and B that can both
               | match the input, then a CFG A|B has an ambiguity (two
               | possible derivations), but a PEG A/B explicitly says that
               | the A derivation is chosen. The good part is that unlike
               | a CFG, the PEG doesn't require you to go and fix anything
               | (the / operator already did that for you). This makes the
               | initial implementation of the grammar easier.
               | 
               | On the other hand, if you already have code in the wild
               | that uses the old grammar G1, and in order to add new
               | features, you introduce a new grammar G2 that is a
               | superset of G1. You need to know if any of the existing
               | code has a derivation in G2 that is different from its
               | derivation in G1 (as that would cause backwards
               | incompatibility). With a PEG, there's no way to tell, so
               | you have to check this by hand (and mistakes are easy).
               | With a CFG, you know that backwards incompatibility
               | happens if and only if G2 has conflicts, and those
               | conflicts are precisely the cases that are not backwards
               | compatible.
        
               | _a_a_a_ wrote:
               | Excellent answer, thanks
        
               | ulrikrasmussen wrote:
               | A PEG is not actually a generative grammar but a domain-
               | specific language for specifying top-down parsers. So
               | they are free of conflicts and ambiguities by definition
               | of their semantics.
               | 
               | PEG is actually just syntactic sugar on top of (G)TDPL:
               | https://en.wikipedia.org/wiki/Top-down_parsing_language
        
           | valenterry wrote:
           | What host language are you using?
        
             | danielvaughn wrote:
             | To be perfectly honest, I'm not sure what that means. I'm
             | new to language design. I can describe what I'm trying to
             | do.
             | 
             | I'm creating a UI description language for designers. The
             | goal is to let them use terminology and mental models
             | familiar to them to describe their designs in code, instead
             | of a visual tool like Figma.
             | 
             | A trivial example would look something like this:
             | component Button {         elements           shape btn-
             | container             text btn-label              style
             | btn-container           fill-color: blue              style
             | btn-label           content: Click Me           text-color:
             | white       }
             | 
             | The output of the language would be an intermediate
             | representation, I'm imagining a JSON object or something
             | with a very specific schema. This can then be transpiled
             | into any format the developer wants - you could build a
             | transpilation target for React, Vue, Svelte, plain old
             | html/css, etc etc.
             | 
             | So I'm in a weird spot where I know what I want to make,
             | but I don't know any of the conventions or tools common in
             | the language design world, because I'm just stepping into
             | it.
        
               | victorNicollet wrote:
               | The host language is the language that you are using to
               | write the compiler/interpreter. I suppose the grandparent
               | is asking the question in order to recommend a parser
               | generator for your use case. From your GitHub profile, I
               | assume it's JS or TypeScript ?
        
               | danielvaughn wrote:
               | Ah I see, thanks. Probably Typescript, although I'm
               | considering Rust as well. I'm using tree-sitter for the
               | lexer.
        
           | no_wizard wrote:
           | you might want to look at Rascal[0]
           | 
           | [0]: https://www.rascal-mpl.org/
        
             | danielvaughn wrote:
             | Nice, this looks interesting. I was building my lexer in
             | tree-sitter but not sure what to use after that. I'll check
             | this out, thank you.
        
         | FrustratedMonky wrote:
         | Is an LR better than a parser/combinator. Seems like there are
         | pluses to rolling your own parser/combinator over the time to
         | integrate with a yacc or other library.
        
           | victorNicollet wrote:
           | I am a bit biased (in the sense that I wrote my own SLR
           | parser generator twice, in C# and F#!), but the benefit of LR
           | is that you can know whether your CFG contains conflicts,
           | which is a game changer for preserving backwards
           | compatibility when you change the syntax later during the
           | life of the DSL.
           | 
           | But if a parser combinator library may support converting the
           | resulting combined parser into an eBNF grammar, and check
           | whether that grammar contains conflicts.
        
             | marcosdumay wrote:
             | Hum... A grammar conflict on parser combinators always
             | require that you step down from the parser abstraction and
             | resolve it by hand with basic language functionality,
             | doesn't it? It's something quite hard to notice.
             | 
             | Are you concerned with the documentation getting out of
             | sync with the actual language?
        
               | victorNicollet wrote:
               | In a situation where I'm adding a new feature to the
               | language, with a new syntax that causes me to add a new
               | rule to the grammar, I'm concerned that the new rule will
               | accidentally "capture" some code that already exists in
               | the wild, that was previously derived by another part of
               | the grammar.
               | 
               | To give a very ugly example, if you have a language with
               | function calls f(expr, expr, expr) and you want to add
               | tuple syntax to expressions with a brand new rule:
               | expr := expr COMMA expr
               | 
               | Then you might have accidentally turned all functions
               | into unary functions, as the tuple rule captures the
               | "expr, expr, expr" part and leaves you with f(expr).
        
             | FrustratedMonky wrote:
             | I do love F# for parsing and for DSL's. Maybe it is
             | 'functional' programming bias.
             | 
             | IF you are using a 'functional' paradigm, you can do your
             | own parser/combinator.
             | 
             | But if you are using C#, you use something like Yacc and
             | LR.?
        
         | trealira wrote:
         | > Hand-written parsers are even worse. Using an LR parser
         | generator is worth the time investment.
         | 
         | I'm curious, what makes you say this? From what I've read
         | online secondhand, people tend to recommend handwritten
         | recursive descent parsing and Pratt/TDOP parsing, because it's
         | easier to add good error messages and later add context
         | sensitive productions (e.g. the way Clang resolves class-wide
         | declarations while parsing C++:
         | https://eli.thegreenplace.net/2012/07/05/how-clang-
         | handles-t...).
        
           | mjul wrote:
           | He explains it in the linked video presentation which is
           | worth watching: the value being that your parser generator
           | will complain about any ambiguity in the grammar (shift-
           | reduce conflicts) rather than having that show up as a "wat?"
           | at runtime.
           | 
           | You might argue that this argument reduces to checking the
           | grammar using static analysis similar to the order generator.
           | But then handwriting the parser is still extra work and risk.
        
             | civilitty wrote:
             | What I've seen most nontrivial projects do (and have done
             | in a commercial product myself) is to start with a parser
             | generator then move to hand written parsers when UX and
             | polish become a bigger priority than exploring the problem
             | space.
             | 
             | The parser generator grammar spec can then be used to
             | generate random test cases and compare the output of the
             | new parser with the old one to make sure they're identical.
        
         | bloopernova wrote:
         | Good lord yes on the language server. Reduce the friction
         | involved with writing code.
         | 
         | I'm frustrated with terraform-ls right now. It doesn't work
         | with private module registries, reloading each module in the
         | .terraform dir over and over. Which impacts performance by
         | maxing out a CPU core.
         | 
         | And yet a feature of Terraform Cloud (TFC) is that you can use
         | private registries.
         | 
         | Writing Terraform is already plagued by friction. Adding TFC
         | adds to the friction. The language server then not loading
         | private modules adds even more.
         | 
         | Hopefully OpenTofu helps in that regard, but I don't think
         | they've got anyone working on the language server either.
        
           | cube2222 wrote:
           | > but I don't think they've got anyone working on the
           | language server either.
           | 
           | Not yet, but it's on our radar. We're focusing on the stable
           | release + registry for now (eta mid-December), but we're
           | planning to work on the language server eventually.
           | 
           | If you have any more frustrations with the language server,
           | feel free to respond here, and it'll help us find avenues for
           | improvements.
           | 
           | Disclaimer: Interim Tech Lead of OpenTofu
        
         | starcraft2wol wrote:
         | This is ridiculously not practical. For lisp programmers making
         | dsls is just how you write programs and you don't need a
         | parser.
        
         | Too wrote:
         | One more from me
         | 
         | - Any expression that may likely include a user-generated
         | variable should have a strategy for how to inject it safely. To
         | avoid sql injection class of attacks. Native templating is the
         | best. If string templating is the only way out, the rules for
         | escaping should be clearly defined and ideally functions for
         | doing so provided by the reference implementation.
         | 
         | Not like the geniuses at Atlassian who came up with JQL that
         | refuse to document how it works, instead delegates all security
         | to the user model and "don't run queries with any data that you
         | didn't provide yourself".
        
       | mejutoco wrote:
       | I first saw this talk of DSLs with Ruby, where it is/was popular
       | to use missing method and others to build your own "language".
       | 
       | I do not mean it in a negative way, but at the time it seemed
       | like calling a utility class DSL was excessive. It just seemed
       | like regular programming, using the language constructs. It also
       | was not enforcing many boundaries (there are escape hatches).
       | 
       | Looking at all of this, would it not be easier to encode the DSL
       | in something like json, and turn it into a data/format problem?
       | Something like terraform. The implementation can then be in any
       | language that can read json. One could imagine, for example,
       | regex defined in json as a DSL.
        
         | victorNicollet wrote:
         | I would say JSON + schema, rather than just JSON. Like
         | S-expressions, JSON saves you from having to implement a
         | tokenizer, and a tokenizer is one of the easier parts of a DSL.
         | You still need to implement something like a syntax, because
         | things that are handled by the syntax now get pushed into later
         | stages of the DSL. However, if you can provide a JSON schema
         | for your language, then the problem is mostly solved !
         | 
         | On the other hand, it's not enough to read JSON, you need a
         | JSON parser library that provides you with the position
         | (line:column) of every JSON value, so that you can emit user-
         | friendly errors.
        
         | chriswarbo wrote:
         | I think _every_ language (DSL or not) should support
         | translation to /from a simple data representation for its
         | syntax trees (whether s-expressions, JSON, XML, etc.). Not
         | necessarily _the same_ representation, but just something
         | structured that existing tools can walk and transform, such
         | that (a) we don 't resort to grepping through bytes (and
         | manually filtering out matches that are comments or strings)
         | and (b) adding a keyword to a language won't break all existing
         | tooling that's unable to parse it.
         | 
         | I wrote about this at
         | http://www.chriswarbo.net/blog/2017-01-31-syntax_trees.html
        
       | LelouBil wrote:
       | I built a DSL for a Visual Novel game I am making, by using ANTLR
       | for parsing and then executing it using C#.
       | 
       | Are there generic frameworks for implementing static analysis
       | that exists out there ?
       | 
       | I did a rudimentary version of it with some simple checks, but I
       | have troubles wrapping my head about checking conditional
       | branches and stuff like this.
        
         | no_wizard wrote:
         | check out Rascal[0]
         | 
         | [0]: https://www.rascal-mpl.org/
        
           | LelouBil wrote:
           | Looks interesting but I think I lack a lot of knowledge to be
           | able to use it right now..
        
       | scottfr wrote:
       | I've created a few DSL's and I've regretted not closely emulating
       | an existing language that users already know.
       | 
       | When creating a new DSL you should find the language that is most
       | familiar to your users. That could be Python, Javascript, or SQL.
       | There is a good chance though that it is spreadsheet formulas.
       | 
       | Whatever that language, align capabilities you add to your DSL
       | with the syntax and function naming of that language only
       | diverging where needed to add the custom capabilities which
       | motivated the DSL.
       | 
       | The closer you align with what your users already know, the more
       | easily they will be able to adopt it. An added benefit of this
       | that emerged in the past year is GPT's will be able to write your
       | DSL with less prompting.
        
       | swader999 wrote:
       | I want to build a DSL for a really specific reporting use case.
       | It's daunting though. Not sure if what I have in mind is a true
       | DSL, perhaps just a souped up builder pattern.
        
       ___________________________________________________________________
       (page generated 2023-11-09 23:02 UTC)