https://justine.lol/lex/ Oct 31^st, 2024 @ justine's web page Weird Lexical Syntax [picture of artistically drawn ringtailed lemur sitting in front of red rectangle saying 'eldritch horrors' in white times new roman text] I just learned 42 programming languages this month to build a new syntax highlighter for llamafile. I feel like I'm up to my eyeballs in programming languages right now. Now that it's halloween, I thought I'd share some of the spookiest most surprising syntax I've seen. The languages I decided to support are Ada, Assembly, BASIC, C, C#, C++, COBOL, CSS, D, FORTH, FORTRAN, Go, Haskell, HTML, Java, JavaScript, Julia, JSON, Kotlin, ld, LISP, Lua, m4, Make, Markdown, MATLAB, Pascal, Perl, PHP, Python, R, Ruby, Rust, Scala, Shell, SQL, Swift, Tcl, TeX, TXT, TypeScript, and Zig. That crosses off pretty much everything on the TIOBE Index except Scratch, which can't be highlighted, since it uses blocks instead of text. How To Code a Syntax Highlighter It's really not difficult to implement a syntax highlighter. You could probably write one over the course of a job interview. My favorite tools for doing this have been C++ and GNU gperf. The hardest problem here is avoiding the need to do a bunch of string comparisons to determine if something is a keyword or not. Most developers would just use a hash table, but gperf lets you create a perfect hash table. For example: %{ #include %} %pic %compare-strncmp %language=ANSI-C %readonly-tables %define lookup-function-name is_keyword_java_constant %% true false null gperf was originally invented for gcc and it's a great way to squeeze out every last drop of performance. If you run the gperf command on the above code above, it'll generate this .c file. You'll notice its hash function only needs to consider a single character in in a string to get a collision free lookup. That's what makes it perfect, and perfect means better performance. I'm not sure who wants to be able to syntax highlight C at 35 MB per second, but I am now able to do so, even though I've defined about 4,000 keywords for the language. Thanks to gperf, those keywords don't slow things down. The rest just boils down to finite state machines. You don't really need flex, bison, or ragel to build a basic syntax highlighter. You simply need a for loop and a switch statement. At least for my use case, where I've really only been focusing on strings, comments, and keywords. If I wanted to highlight things like C function names, well, then I'd probably need to do actual parsing. But focusing on the essentials, we're only really doing lexing at most. See highlight_ada.cpp as an example. Demo All the research you're about to read about on this page, went into making one thing, which is llamafile's new syntax highlighter. This is probably the strongest advantage that llamafile has over ollama these days, since ollama doesn't do syntax highlighting at all. Here's a demo of it running on Windows 10, using the Meta LLaMA 3.2 3B Instruct model. Please note, these llamafiles will run on MacOS, Linux, FreeBSD, and NetBSD too. [screencast of the Mozilla/Llama-3.2-3B-Instruct-llamafile LLM being used to generate code in various programming languages (FORTRAN, Rust, C++, Perl) for printing the first 100 prime numbers] The new highlighter and chatbot interface has made llamafile so pleasant for me to use, combined with the fact that open weights models like gemma 27b it have gotten so good, that it's become increasingly rare that I'll feel tempted to use Claude these days. Examples of Surprising Lexical Syntax So while writing this highlighter, let's talk about the kinds of lexical syntax that surprised me. C The C programming language, despite claiming to be simple, actually has some of the weirdest lexical elements of any language. For starters, we have trigraphs, which were probably invented to help Europeans use C when using keyboards that didn't include #, [, \, ^, {, |, }, and ~. You can replace those characters with ??=, ??(, ??/, ??), ??', ??<, ??!, ??>, and ??-. Intuitive, right? That means, for example, the following is perfectly valid C code. int main(int argc, char* argv??(??)) ??< printf("hello world\n"); ??> That is, at least until trigraphs were removed in the C23 standard. However compilers will be supporting this syntax forever for legacy software, so a good syntax highlighter ought to too. But just because trigraphs are officially dead, doesn't mean the standards committees haven't thought up other weird syntax to replace it. Consider universal characters: int \uFEB2 = 1; This feature is useful for anyone who wants, for example, variable names with arabic characters while still keeping the source code pure ASCII. I'm not sure why anyone would use it. I was hoping I could abuse this to say: int main(int argc, char* argv\u005b\u005d) \u007b printf("hello world\n"); \u007d But alas, GCC raises an error if universal characters aren't used on the specific UNICODE planes that've been blessed by the standards committee. This next one is one of my favorites. Did you know that a single line comment in C can span multiple lines if you use backslash at the end of the line? //hi\ there Most other languages don't support this. Even languages that allow backslash escapes in their source code (e.g. Perl, Ruby, and Shell) don't have this particular feature from C. The ones that do support this too, as far as I can tell, are Tcl and GNU Make. Tools for syntax highlighting oftentimes get this wrong, like Emacs and Pygments. Although Vim seems to always be right about backslash. Haskell Every C programmers knows you can't embed a multi-line comment in a multi-line comment. For example: /* hello /* again */ nope nope nope */ However with Haskell, you can. They finally fixed the bug. Although they did adopt a different syntax. -- Test nested comments within code blocks let result3 = {- This comment contains {- a nested comment -} -} 10 - 5 Tcl The thing that surprised me most about Tcl, is that identifiers can have quotes in them. For example, this program will print a"b: puts a"b You can even have quote in your variable names, however you'll only be able to reference it if you use the ${a"b} notation, rather than $a"b. set a"b doge puts ${a"b} JavaScript JavaScript has a builtin lexical syntax for regular expressions. However it's easy to lex it wrong if you aren't paying attention. Consider the following: var foo = /[/]/g; When I first wrote my lexer, I would simply scan for the closing slash, and assume that any slashes inside the regex would be escaped. That turned out to be wrong when I highlighted some minified code. If a slash is inside the square quotes for a character set, then that slash doesn't need to be escaped! Now onto the even weirder. There's some invisible UNICODE characters called the LINE SEPARATOR (u2028) and PARAGRAPH SEPARATOR (u2029). I don't know what the use case is for these codepoints, but the ECMAScript standard defines them as line terminators, which effectively makes them the same thing as \n. Since these are Trojan Source characters, I configure my Emacs to render them as | and P. However most software hasn't been written to be aware of these characters, and will oftentimes render them as question marks. Also as far as I know, no other language does this. I was able to use that to my advantage for SectorLISP, since it let me create C + JavaScript polyglots. javascript syntax highlighting//P` ... C only code goes here ... //` That's how I'd insert C code into JavaScript files. c syntax highlighting//P` #if 0 //` ... JavaScript only code goes here ... //P` #endif //` And that's how I'd insert JavaScript into my C source code. An example of a piece of production code where I did this is lisp.js which is what powers my SectorLISP blog post. It both runs in the browser, and you can compile it with GCC and run it locally too. llamafile is able to correctly syntax highlight this stuff, but I've yet to find another syntax highlighter that does too. Not that it matters, since I doubt an LLM would ever print this. But it sure is fun to think about these corner cases. Shell We're all familiar with the heredoc syntax of shell scripts, e.g. cat < Public Function SomeFunction() As String #Else Public Function SomeFunction() As String #End If Perl One of the trickier languages to highlight is Perl. It exists in the spiritual gulf between shells and programming languages, and inherits the complexity of both. Perl isn't as popular today as it once was, but its influence continues to be prolific. Perl made regular expressions a first class citizen of the language, and the way regex works in Perl has since been adopted by many other programming languages, such as Python. However the regex lexical syntax itself continues to be somewhat unique. For example, in Perl, you can replace text similar to sed as follows: my $string = "HELLO, World!"; $string =~ s/hello/Perl/i; print $string; # Output: Perl, World! Like sed, Perl also allows you to replace the slashes with an arbitrary punctuation character, since that makes it easier for you to put slashes inside your regex. $string =~ s!hello!Perl!i; What you might not have known, is that it's possible to do this with mirrored characters as well, in which case you need to insert an additional character: $string =~ s{hello}{Perl}i; However s/// isn't the only weird thing that needs to be highlighted like a string. Perl has a wide variety of other magic prefixes. /case sensitive match/ /case insensitive match/i y/abc/xyz/e s!hi!there! m!hi!i m;hi;i qr!hi!u qw!hi!h qq!hi!h qx!hi!h m-hi- s-hi-there-g s"hi"there"g s@hi@there@ yo s{hi}{there}g One thing that makes this tricky to highlight, is you need to take context into consideration, so you don't accidentally think that y/x/ y/ is a division formula. Thankfully, Perl makes this relatively easy, because variables can always be counted upon to have sigils, which are usually $ for scalars, @ for arrays, and % for hashes. my $greeting = "Hello, world!"; # Array: A list of names my @names = ("Alice", "Bob", "Charlie"); # Hash: A dictionary of ages my %ages = ("Alice" => 30, "Bob" => 25, "Charlie" => 35); # Print the greeting print "$greeting\n"; # Print each name from the array foreach my $name (@names) { print "$name\n"; } This helps us avoid the need for parsing the language grammar. Perl also has this goofy convention for writing man pages in your source code. Basically, any =word at the start of the line will get it going, and =cut will finish it. #!/usr/bin/perl =pod =head1 NAME my_silly_script - A Perl script demonstrating =cut syntax =head1 SYNOPSIS my_silly_script [OPTIONS] =head1 DESCRIPTION This script does absolutely nothing useful, but it showcases the quirky =cut syntax for POD documentation in Perl. =head1 OPTIONS There are no options. =head1 AUTHOR Your Name =head1 COPYRIGHT Copyright (c) 2023 Your Name. All rights reserved. =cut print "Hello, world!\n"; Ruby Of all the languages, I've saved the best for last, which is Ruby. Now here's a language whose syntax evades all attempts at understanding. Ruby is the union of all earlier languages, and it's not even formally documented. Their manual has a section on Ruby syntax, but it's very light on details. Whenever I try to test my syntax highlighting, by concatenating all the .rb files on my hard drive, there's always another file that finds some way to break it. def `(command) return "just testing a backquote override" end Since ruby supports backquote syntax like var = `echo hello`, I'm not exactly sure how to tell that the backquote above isn't meant to be highlighted as a string. Another example is this: when /\.*\.h/ options[:includes] <