https://skalt.github.io/projects/brzozowski_ts/ * Home * Search * Projects * Posts * * Notes * TIL * Bookmarklets Implementing Regular Expressions in TypeScript Types (Badly) Published on 2024-10-10 skalt/brzozowski-ts * Summary * Backstory * Compile-time RegExp checking is feasible + Brzozowski derivatives * Proof-of-concept * And then I got carried away. * Reflections + Current uses & the path forward + You probably shouldn't use my code + Tricks for programming in TypeScript Generics + A nerd snipe for someone else Summary This is a cautionary tale about how I ended up writing a (bad) regular expression parser and evaluator in pure TypeScript types. type HexStr = Recognize<"[0-9a-fA-F]+", S>; type IsMatch = HexStr extends S ? true : false; const isHex: IsMatch<"deadbeef"> = true const notHex: IsMatch<"nope"> = false The novelty here is parsing and evaluating regular expressions at compile-time. The inspiration for this technique came from a comment from 2020 on the RegEx-validated string type discussion. Backstory To prepare for Halloween, I was writing a function castSpell that accepts a valid hex string. I was writing in TypeScript, so the usual way to do this is using branded types, also known as nominal types: type Hex = string & { readonly isHex: unique symbol; }; export const castSpell = (hex: Hex) => { /* magic */ }; castSpell("asdf" as Hex); // ok! // Wait, "s" isn't a hex character... I could even use type predicates to narrow a general string down to a branded hex string: const isHex = (str: string): str is Hex => /[0-9a-fA-F]/.test(str); const mustBeHex = (str: string): Hex => { if (isHex(str)) return str; else throw new Error( `'${str}' is not a hex string` ); }; [playground link] My options so far for guaranteeing that only valid hex strings are passed to castSpell are: 1. delegate checking valid hex strings to the next programmer + pro: check happens at compile time + con: can be wrong 2. delegate checking valid hex strings to the JS runtime + pro: always right + con: check happens at runtime These two options are good enough for pretty much every practical use-case. But what if I wanted to check the validity of hex strings automatically at compile time? Compile-time RegExp checking is feasible TypeScript's type system is Turing-complete, so parsing and evaluating regular expressions is definitely possible. I've already see things like solving N-Queens, parsing and evaluating SQL, or implementing deterministic finite automata in the wild. Ignoring the question of "should I do this", only the question "how to check if a RegEx matches a string at compile-time" remains. Brzozowski derivatives A Brzozowski derivative of a set of strings S with respect to a string U is the set of strings in S that start with U, with the prefix removed. Here's a Brzozowski derivative in TypeScript: type Derivative< S extends string, U extends string, > = S extends `${U}${infer Rest}` ? Rest : never; type _ = Derivative<"cat" | "cow" | "dog", "c">; // => "at" | "ow" [playground link] Notably, we can check if a string matches a regular expression using Brzozowski derivatives: A string U is a member of the string set denoted by a generalized regular expression R if and only if the empty string is a member of the string set denoted by the derivative of R with respect to U. - Brzozowski (1964), p.483, Theorem 4.2 Proof-of-concept type Err = { error: Msg }; type ErrInfinite = Err<"type 'string' is infinite & cannot be matched">; type Digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"; type LowerHex = "a" | "b" | "c" | "d" | "e" | "f"; type UpperHex = Uppercase; type HexChar = Digit | LowerHex | UpperHex; type Derivative< Str extends string, Prefix extends string, > = Str extends `${Prefix}${infer Rest}` ? Rest : Err<"No match"> & { Str: Str; Prefix: Prefix }; type IsHexStr = string extends S ? Err<"type 'string' is infinite & cannot be matched"> : S extends "" ? true // matched! : Derivative extends infer Result ? Result extends string ? IsHexStr : Result : Err<"unreachable: infallible infer statement">; type RecognizeHex = IsHexStr extends infer Result ? Result extends true ? S : Result // Result must be an Err<_> : Err<"unreachable: infallible infer statement">; const castSpell = (hex: RecognizeHex) => hex; const spell = castSpell("abc123"); // ok! const spellCheck: typeof spell = "abc123"; // ok! //@ts-expect-error castSpell("0xfff"); // => Err<"no match"> & {Str: "xfff"; Prefix: HexChar} [playground link] In the proof-of-concept above, Derivative is shortening the string-to-be-matched but not the regular expression [a-fA-F0-9]+, since the derivative of the RegEx with respect to any valid hex character is the same RegEx. In a real RegEx evaluator, we'd need to keep track of the parts of the RegEx in order to consume them. And then I got carried away. Next thing I knew, I had written a mostly-complete RegExp parser and evaluator. Counting the lines of code with scc found 805 lines of type definitions, which is … a lot. scc also produces COCOMO estimates of the cost to develop some code: scc . | grep '^Estimated' | sed 's/^/# /g' # Estimated Cost to Develop (organic) $51,771 # Estimated Schedule Effort (organic) 4.46 months # Estimated People Required (organic) 1.03 Misinterpreting the estimated cost of what I wrote as its estimated worth always makes me feel better about myself! Reflections Current uses & the path forward brzozowski_ts is currently useful if: * you want to validate that a string constant is part of a large set of strings that is impractical to enumerate as a string union. * you can write a RegEx that correctly describes your impractically-large-string-set. * you want to provide fast feedback to your users about the validity of string constants. * you also use nominal types to validate dynamic strings. * you want to find out the number of captures and/or names of capture groups a RegEx would produce Possible use-cases include: * infrastructure-as-code functions that accept IP addresses * strings that have a maximum length (e.g. resource names) * paths and routing paths (note: brzozowski_ts supports group extraction!) * urls, email addresses * hex/base64 strings The main use, however, is seeing how well compile-time regex works in practice. As I learn more, this library might go through a few more iterations. However, I don't intend to bring up to production quality. I'm hoping that experiments using brzozowski_ts will eventually contribute to discussion about compile-time RegEx validation in TypeScript. You probably shouldn't use my code It's overkill for most scenarios If you're checking if a string is part of a set of: * a finite number of strings (up to a few hundred): use a big string union. * an infinite set of strings with a small number of prefixes or suffixes: use template literal types. * an infinite set of strings with a simple pattern like the hex strings above: use this technique on your own. The simpler code above is faster than pulling in 800+ lines of type inference. Compile-time RegEx is inappropriate for many advanced use-cases If you're checking if a string is part of a set of: * a large set of strings with a nesting pattern (like HTML): you're out of luck, regular expressions can't match nested structures. * a large set of probably long, potentially infinite-length strings with a tricky-yet-regular pattern (e.g. CSVs): parse the input at runtime. Parse, don't Validate has a good overview of the benefits of parsing inputs at the boundaries of your program! It will slow down your compile times Though I'm not sure how much - benchmarking is a TODO. I developed on an 8-core, 32-GB linux x86_64 machine. I experienced responsive incremental compilation in VSCode, and running the test suite took ~2.4s. It's alpha quality The RegExp parsing and evaluation hasn't yet been extensively fuzzed. There are almost certainly bugs in the implementation. It also does naive string matching with backtracking, so it's certainly vulnerable to RegEx Denial of Service (ReDoS), albeit ReDoS of your compile times and not your server. It's a alpha release I might break the public APIs without warning. It's not a release at all, it's just a repo I'm not releasing the code to NPM until fuzzing makes me confident the code is correct. Tricks for programming in TypeScript Generics Read through the tricks in type-fest There are some very cool tricks in there! That codebase taught me: * arithmetic using tuple lengths * how to check if a type is equal to never despite the fact that all types are assignable to never Never say never: use extensible error types An error type stuffed with context makes debugging much easier than using never to handle unexpected branches of inference. I started out using never to handle cases I expected never to encounter, like type MyType = OtherType extends ThingItMustBe ? DoMoreStuff : never; After debugging why a deeply nested type was unexpectedly inferring never for the nth time, I started using type MyType = OtherType extends ThingItMustBe ? DoMoreStuff : { error: "this should never happen and here's why" } & { T: T; other: OtherType; }; Err also side-steps the problem that never is assignable to string: const oops: never extends string ? true : false = true; [playground link] Write test-cases as you develop You can put the test cases next to your type-under-construction using block scopes to avoid polluting your module's namespace: type MyType = T extends "example" ? "expected" : never; { const _: MyType<"example"> = "expected"; } With fast on-hover/inline type hints from your TypeScript language server, you can obtain a REPL-like development experience. Also, you can use the //@ts-expect-error directive to mark test-cases that should error out. prettier --experimental-ternaries handles deeply-nested types gracefully If you're doing something advanced with types, I'd highly recommend it. A nerd snipe for someone else Someone out there has written a compiler that targets TypeScript types, right? (c) Steven Kalt; This work is licensed as CC-BY-NC-SA 4.0 Last edited on 2024-10-16 in commit 465f0b8. Edit this Page