[HN Gopher] Weird Lexical Syntax
       ___________________________________________________________________
        
       Weird Lexical Syntax
        
       Author : jart
       Score  : 304 points
       Date   : 2024-11-02 07:45 UTC (15 hours ago)
        
 (HTM) web link (justine.lol)
 (TXT) w3m dump (justine.lol)
        
       | llm_trw wrote:
       | I've done a fair bit of forth and I've not seen c" used. The
       | usual string printing operator is ." .
        
         | mananaysiempre wrote:
         | Counted ("Pascal") strings are rare nowadays so C" is not often
         | used. Its addr len equivalent is S" and that one is fairly
         | common in string manipulation code.
        
         | kragen wrote:
         | Right, _c "_ is for when you want to pass a literal string to
         | some other word, not print it. But I agree that it's not very
         | common, because you normally use _s "_ for that, which leaves
         | the address and length on the stack, while _c "_ leaves just an
         | address on the stack, pointing to a one-byte count field
         | followed by the bytes. I think adding _c "_ in Forth-83 (and
         | renaming _"_ to _s "_) was a mistake, and it would have been
         | better to deprecate the standard words that expect or produce
         | such counted strings, other than _count_ itself. See
         | https://forth-standard.org/standard/alpha, https://forth-
         | standard.org/standard/core/Cq, https://forth-
         | standard.org/standard/core/COUNT, and https://forth-
         | standard.org/standard/core/Sq.
         | 
         | You can easily add new string and comment syntaxes to Forth,
         | though. For example, you can add BCPL-style // comments to end
         | of line with this line of code in, I believe, all standard
         | Forths, though I've only tested it in GForth:
         | : // 10 word drop ; immediate
         | 
         | Getting it to work in block files requires more work but is
         | still only a few lines of code. The standard word _\_ does
         | this, and _see \_ decompiles the GForth implementation as
         | : \         blk @         IF     >in @ c/l / 1+ c/l * >in !
         | EXIT         THEN         source >in ! drop ; immediate
         | 
         | This kind of thing was commonly done for text editor commands,
         | for example; you might define _i_ as a word that reads text
         | until the end of the line and inserts it at the current
         | position in the editor, rather than discarding it like my  //
         | above. Among other things, the screen editor in F83 does
         | exactly that.
         | 
         | So, as with Perl, PostScript, TeX, m4, and Lisps that support
         | readmacros, you can't lex Forth without executing it.
        
       | skrebbel wrote:
       | This was a delightful read, thanks!
        
       | croisillon wrote:
       | Glad to see confirmed that PHP is the most non weird programming
       | language ;)
        
         | rererereferred wrote:
         | I recently learned php's heredoc can have space before it and
         | it will remove those spaces from the lines in the string:
         | $a = <<<EOL             This is             not indented
         | but this has 4 spaces of indentation             EOL;
         | 
         | But the spaces have to match, if any line has less spaces than
         | the EOL it gives an error.
        
         | alganet wrote:
         | There are two types of languages: the ones full of quirks and
         | the ones no one uses.
        
       | skitter wrote:
       | Another syntax oddity (not mentioned here) that breaks most
       | highlighters: In Java, unicode escapes can be anywhere, not just
       | in strings. For example, the following is a valid class:
       | class Foo\u007b}
       | 
       | and this assert will not trigger:                   assert
       | // String literals can have unicode escapes like \u000A!
       | "Hello World".equals("\u00E4");
        
         | ivanjermakov wrote:
         | I have never seen this in Java! Is there any use cases where it
         | could be useful?
        
           | susam wrote:
           | I don't know about usefulness but it does let us write
           | identifiers using Unicode characters. For example:
           | public class Foo {           public static void main(String[]
           | args) {               double \u03c0 = 3.14159265;
           | System.out.println("\u03c0 = " + \u03c0);           }       }
           | 
           | Output:                 $ javac Foo.java && java Foo       p
           | = 3.14159265
           | 
           | Of course, nowadays we can simply write this with any decent
           | editor:                 public class Foo {           public
           | static void main(String[] args) {               double p =
           | 3.14159265;               System.out.println("p = " + p);
           | }       }
           | 
           | Support for Unicode escape sequences is a result of how the
           | Java Language Specification (JLS) defines InputCharacter.
           | Quoting from Section 3.4 of JLS
           | <https://docs.oracle.com/javase/specs/jls/se23/jls23.pdf>:
           | InputCharacter:         UnicodeInputCharacter but not CR or
           | LF
           | 
           | UnicodeInputCharacter is defined as the following in section
           | 3.3:                 UnicodeInputCharacter:
           | UnicodeEscape         RawInputCharacter
           | UnicodeEscape:         \ UnicodeMarker HexDigit HexDigit
           | HexDigit HexDigit            UnicodeMarker:         u {u}
           | HexDigit:         (one of)         0 1 2 3 4 5 6 7 8 9 a b c
           | d e f A B C D E F            RawInputCharacter:         any
           | Unicode character
           | 
           | As a result the lexical analyser honours Unicode escape
           | sequences absolutely anywhere in the program text. For
           | example, this is a valid Java program:                 public
           | class Bar {           public static void
           | \u006d\u0061\u0069\u006e(String[] args) {
           | System.out.println("hello, world");           }       }
           | 
           | Here is the output:                 $ javac Bar.java && java
           | Bar       hello, world
           | 
           | However, this is an incorrect Java program:
           | public class Baz {           // This comment contains \u6d.
           | public static void main(String[] args) {
           | System.out.println("hello, world");           }       }
           | 
           | Here is the error:                 $ javac Baz.java
           | Baz.java:2: error: illegal unicode escape           // This
           | comment contains \u6d.
           | ^       1 error
           | 
           | Yes, this is an error even if the illegal Unicode escape
           | sequence occurs in a comment!
        
             | ivanjermakov wrote:
             | I wonder if full unicode range was accepted because some
             | companies are writing code in non-english.
        
           | layer8 wrote:
           | Javac uses the platform encoding [0] by default to interpret
           | Java source files. This means that Java source code files are
           | inherently non-portable. When Java was first developed (and
           | for a long time after), this was the default situation for
           | any kind of plain text files. The escape sequence syntax
           | allows to transform [1] Java source code into a portable
           | (that is, ASCII-only) representation that is completely
           | equivalent to the original, and also to convert it back to
           | any platform encoding.
           | 
           | Source control clients could apply this automatically upon
           | checkin/checkout, so that clients with different platform
           | encodings can work together. Alternatively, IDEs could do
           | this when saving/loading Java source files. That never quite
           | caught on, and the general advice was to stick to ASCII, at
           | least outside comments.
           | 
           | [0] Since JDK 18, the default encoding defaults to UTF-8.
           | This probably also extends to _javac_ , though I haven't
           | verified it.
           | 
           | [1] https://docs.oracle.com/javase/8/docs/technotes/tools/win
           | dow...
        
         | mistercow wrote:
         | I also argue that failing to syntax highlight this correctly is
         | a security issue. You can terminate block comments with Unicode
         | escapes, so if you wanted to hide some malicious code in a Java
         | source file, you just need an excuse for there to be a block of
         | Unicode escapes in a comment. A dev who doesn't know about this
         | quirk is likely to just skip over it, assuming it's commented
         | out.
        
       | mcphage wrote:
       | At one point there was an open source project to formally specify
       | Ruby, but I don't know if it's still alive:
       | https://github.com/ruby/spec
       | 
       | Hmm, it seems to be alive, but based more on behavior than
       | syntax.
        
       | keybored wrote:
       | Meanwhile NeoVim doesn't syntax highlight my commit message
       | properly if I have messed with "commit cleanup" enough.
       | 
       | The comment character in Git commit messages can be a problem
       | when you insist on prepending your commits with some "id" and the
       | id starts with `#`. One suggestion was to allow backslash escapes
       | in commit messages since that makes sense to a computer
       | scientist.[1]
       | 
       | But looking at all of this lexical stuff I wonder if makes-sense-
       | to-computer-scientist is a good goal. They invented the problem
       | of using a uniform delimiter for strings and then had to solve
       | their own problem. Maybe it was hard to use backtick in the 70's
       | and 80's, but today[2] you could use backtick to start a string
       | and a single quote to end it.
       | 
       | What do C-like programming languages use single quotes for? To
       | quote characters. Why do you need to quote characters? I've never
       | seen a literal character which needed an "end character" marker.
       | 
       | Raw strings would still be useful but you wouldn't need raw
       | strings just to do a very basic thing like make a string which
       | has typewriter quotes in it.
       | 
       | Of course this was for C-like languages. Don't even get me
       | started on shell and related languages where basically everything
       | is a string and you have to make a single-quote/double-quote
       | battle plan before doing anything slightly nested.
       | 
       | [1] https://lore.kernel.org/git/vpq3808p40o.fsf@anie.imag.fr/
       | 
       | [2] Notwithstanding us Europeans that use a dead-key keyboard
       | layout where you have to type twice to get one measly backtick
       | (not that I use those)
        
         | pwdisswordfishz wrote:
         | > The comment character in Git commit messages can be a problem
         | when you insist on prepending your commits with some "id" and
         | the id starts with `#`
         | 
         | https://git-scm.com/docs/git-commit#Documentation/git-commit...
        
           | keybored wrote:
           | See "commit cleanup".
           | 
           | There's surprising layers to this. That the reporter in that
           | thread says that git-commit will "happily" accept `#` in
           | commit messages is half-true: it will accept it if you don't
           | edit the message since the `default` cleanup (that you linked
           | to) will not remove comments if the message is given through
           | things like `-m` and not an editing session. So `git commit
           | -m'#something' is fine. But then try to do rebase and cherry-
           | pick and whatever else later, maybe get a merge commit
           | message with a commented "conflicted" files. Well it can get
           | confusing.
        
         | kragen wrote:
         | > _Maybe it was hard to use backtick in the 70's and 80's, but
         | today[2] you could use backtick to start a string and a single
         | quote to end it._
         | 
         | That's how quoting works by default in m4 and TeX, both defined
         | in the 70s. Unfortunately Unicode retconned the ASCII
         | apostrophe character ' to be a vertical line, maybe out of a
         | misguided deference to Microsoft Windows, and now we all have
         | to suffer the consequences. (Unless we're using Computer Modern
         | fonts or other fonts that predate this error, such as VGA font
         | ROM dumps.)
         | 
         | In the 70s and 80s, and into the current millennium on Unix,
         | `x' did look like 'x', but now instead it looks like dogshit.
         | Even if you are willing to require a custom font for
         | readability, though, that doesn't solve the problem; you need
         | some way to include an apostrophe in your quoted string!
         | 
         | As for end delimiters, C itself supports multicharacter
         | literals, which are potentially useful for things like
         | Macintosh type and creator codes, or FTP commands.
         | Unfortunately, following the Unicode botch theme, the standard
         | failed to define an endianness or minimum width for them, so
         | they're not very useful today. You can use them as enum values
         | if you want to make your memory dumps easier to read in the
         | debugger, and that's about it. I think Microsoft's compiler
         | botched them so badly that even that's not an option if you
         | need your code to run on it.
        
           | ygra wrote:
           | > Unfortunately Unicode retconned the ASCII apostrophe
           | character ' to be a vertical line
           | 
           | Unicode does not precribe the appearance of characters.
           | Although in the code chart1 it says >>neutral (vertical)
           | glyph with mixed usage<< (next to >>apostrophe-quote<< and
           | >>single quote<<), font vendors have to deal with this mixed
           | usage. And with Unicode the correct quotation marks have
           | their own code points, making it unnecessary to design fonts
           | where the ASCII apostrophe takes their form, but rendering
           | all other uses pretty ugly.
           | 
           | I would regard using ` and ' as paired quotation marks as a
           | hack from times when typographic expression was simply not
           | possible with the character sets of the day.
           | 
           | _________
           | 
           | 1                   0027 ' APOSTROPHE         = apostrophe-
           | quote (1.0)         = single quote         = APL quote
           | * neutral (vertical) glyph with mixed usage         * 2019 '
           | is preferred for apostrophe         * preferred characters in
           | English for paired quotation marks are 2018 ' & 2019 '
           | * 05F3 ' is preferred for geresh when writing Hebrew
           | - 02B9 ' modifier letter prime         - 02BC ' modifier
           | letter apostrophe         - 02C8 ' modifier letter vertical
           | line         - 0301 $ combining acute accent         - 030D $
           | combining vertical line above         - 05F3 ' hebrew
           | punctuation geresh         - 2018 ' left single quotation
           | mark         - 2019 ' right single quotation mark         -
           | 2032 ' prime         - A78C  latin small letter saltillo<<
        
           | keybored wrote:
           | > That's how quoting works by default in m4 and TeX, both
           | defined in the 70s.
           | 
           | Good point. And it was in m4[1] I saw that
           | backtick+apostrophe syntax. I would have probably not thought
           | of that possibility if I hadn't seen it there.
           | 
           | [1] Probably on Wikipedia since I have never used it
           | 
           | > Unfortunately Unicode retconned the ASCII apostrophe
           | character ' to be a vertical line, maybe out of a misguided
           | deference to Microsoft Windows, and now we all have to suffer
           | the consequences. (Unless we're using Computer Modern fonts
           | or other fonts that predate this error, such as VGA font ROM
           | dumps.)
           | 
           | I do think the vertical line looks subpar (and I don't use it
           | in prose). But most programmers don't seem bothered by it. :|
           | 
           | > In the 70s and 80s, and into the current millennium on
           | Unix, `x' did look like 'x', but now instead it looks like
           | dogshit.
           | 
           | Emacs tries to render it like 'x' since it uses
           | backtick+apostrophe for quotes. With some mixed results in my
           | experience.
           | 
           | > Even if you are willing to require a custom font for
           | readability, though, that doesn't solve the problem; you need
           | some way to include an apostrophe in your quoted string!
           | 
           | Aha, I honestly didn't even think that far. Seems a bit
           | restrictive to not be able to use possessives and
           | contractions in strings without escapes.
           | 
           | > As for end delimiters, C itself supports multicharacter
           | literals, which are potentially useful for things like
           | Macintosh type and creator codes, or FTP commands.
           | 
           | I should have made it clear that I was only considering
           | C-likes and not C itself. A language from the C trigraph days
           | can be excused. To a certain extent.
        
             | kragen wrote:
             | I'd forgotten about `' in Emacs documentation! That may be
             | influenced by TeX.
             | 
             | C multicharacter literals are unrelated to trigraphs.
             | Trigraphs were a mistake added many years later in the ANSI
             | process.
        
           | tom_ wrote:
           | See also: https://www.cl.cam.ac.uk/~mgk25/ucs/quotes.html
        
             | kragen wrote:
             | This is an excellent document. I disagree with its
             | normative conclusions, because I think being incompatible
             | with ASCII, Unix, Emacs, and TeX is worse than being
             | incompatible with ISO-8859-1, Microsoft Windows, and MacOS
             | 9, but it is an excellent reference for the factual
             | background.
        
         | shawa_a_a wrote:
         | The comment character is also configurable:
         | git config core.commentchar <char>
         | 
         | This is helpful where you want to use use say, markdown to have
         | tidily formatted commit messages make up your pull request body
         | too.
        
           | keybored wrote:
           | I want to try to set it to `auto` and see what spicy things
           | it comes up with.
        
       | yen223 wrote:
       | select'select'select
       | 
       | is a perfectly valid SQL query, at least for Postgres.
       | 
       | Languages' approach to whitespace between tokens is all over the
       | place
        
       | notsylver wrote:
       | As soon as I saw this was part of llamafile I was hoping that it
       | would be used to limit LLM output to always be "valid" code as
       | soon as it saw the backticks, but I suppose most LLMs don't have
       | problems with that anyway. And I'm not sure you'd want something
       | like that automatically forcing valid code anyway
        
         | dilap wrote:
         | llama.cpp does support something like this -- you can give it a
         | grammar which restricts the set of available next tokens that
         | are sampled over
         | 
         | so in theory you could notice "```python" or whatever and then
         | start restricting to valid python code. (in least in theory,
         | not sure how feasible/possible it would be in practice w/ their
         | grammar format.)
         | 
         | for code i'm not sure how useful it would be since likely any
         | model that is giving you working code wouldn't be struggling w/
         | syntax errors anyway?
         | 
         | but i have had success experimentally using the feature to
         | drive fiction content for a game from a smaller llm to be in a
         | very specific format.
        
           | notsylver wrote:
           | yeah, ive used llama.cpp grammars before, which is why i was
           | thinking about it. i just think it'd be cool for llamafile to
           | do basically that, but with included defaults so you could
           | eg, require JSON output. it could be cool for prototyping or
           | something. but i dont think that would be too useful anyway,
           | most of the time i think you would want to restrict it to a
           | specific schema, so i can only see it being useful for
           | something like a tiny local LLM for code completion, but that
           | would just encourage valid-looking but incorrect code.
           | 
           | i think i just like the idea of restricting LLM output, it
           | has a lot of interesting use cases
        
             | dilap wrote:
             | gotchya. i do think that is a cool idea actually -- LLMs
             | tiny enough to do useful things with formally structured
             | output but not big enough to nail the structure ~100% is
             | probably not an empty set.
        
       | pwdisswordfishz wrote:
       | > Of all the languages, I've saved the best for last, which is
       | Ruby. Now here's a language whose syntax evades all attempts at
       | understanding.
       | 
       | TeX with its arbitrarily reprogrammable lexer: how adorable
        
         | fanf2 wrote:
         | Lisp reader macros allow you to program its lexer too.
        
           | skydhash wrote:
           | You can basically define a new language with a few lines of
           | code in Racket.
        
       | pansa2 wrote:
       | > _TypeScript, Swift, Kotlin, and Scala take string interpolation
       | to the furthest extreme of encouraging actual code being embedded
       | inside strings. So to highlight a string, one must count curly
       | brackets and maintain a stack of parser states._
       | 
       | Presumably this is also true in Python - IIRC the brace-delimited
       | fields within f-strings may contain arbitrary expressions.
       | 
       | More generally, this must mean that the lexical grammar of those
       | languages isn't regular. "Maintaining a stack" isn't part of a
       | finite-state machine for a regular grammar - instead we're in the
       | realm of pushdown automata and context-free grammars.
       | 
       | Is it even possible to support generalized string interpolation
       | within a strictly regular lexical grammar?
        
         | aphantastic wrote:
         | > Is it even possible to support generalized string
         | interpolation within a strictly regular lexical grammar?
         | 
         | Almost certainly not, a fun exercise is to attempt to devise a
         | Pumping tactic for your proposed language. If it doesn't exist,
         | it's not regular.
         | 
         | https://en.m.wikipedia.org/wiki/Pumping_lemma_for_regular_la...
        
         | fanf2 wrote:
         | Complicated interpolation can be lexed as a regular language if
         | you treat strings as three separate lexical things, eg in
         | JavaScript template literals there are,
         | `stuff${         }stuff${         }stuff`
         | 
         | so the ${ and } are extra closing and opening string
         | delimiters, leaving the nesting to be handled by the parser.
         | 
         | You need a lexer hack so that the lexer does not treat } as the
         | start of a string literal, except when the parser is inside an
         | interpolation but all nested {} have been closed.
        
       | irdc wrote:
       | I'd be interested to see a re-usable implementation of joe's[0]
       | syntax highlighting.[1] The format is powerful enough to allow
       | for the proper highlighting of Python f-strings.[2]
       | 
       | 0. https://joe-editor.sf.net/
       | 
       | 1. https://github.com/cmur2/joe-
       | syntax/blob/joe-4.4/misc/HowItW...
       | 
       | 2.
       | https://gist.github.com/irdc/6188f11b1e699d615ce2520f03f1d0d...
        
         | pama wrote:
         | Interestingly, python f-strings changed their syntax at version
         | 3.12, so highlighting should depend on the version.
        
           | irdc wrote:
           | It's just that nesting them arbitrarily is now allowed,
           | right? That shouldn't matter much for a mere syntax
           | highlighter then. And one could even argue that code that
           | relies on this too much is not really for human consumption.
        
             | pansa2 wrote:
             | Also, you can now use the same quote character that
             | encloses an f-string within the {} expressions. That could
             | make them harder to tokenize, because it makes it harder to
             | recognise the end of the string.
        
       | rererereferred wrote:
       | In the C# multiquoted strings, how does it know this:
       | Console.WriteLine("""""");        Console.WriteLine("""""");
       | 
       | Are 2 triplequoted empty strings and not one
       | "\nConsole.WriteLine(" sixtuplequoted string?
        
         | ygra wrote:
         | The former, I'd say.
         | 
         | https://learn.microsoft.com/en-us/dotnet/csharp/programming-...
         | 
         | For a multi-line string the quotes have to be on their own
         | line.
        
         | Joker_vD wrote:
         | If the opening quotes are followed by anything that is not a
         | whitespace before the next new-line (or EOF), then it's a
         | single-line string.
         | 
         | I imagine implementing those things took several iterations :)
        
         | yen223 wrote:
         | It's a syntax error!                 Unterminated raw string
         | literal.
         | 
         | https://replit.com/@Wei-YenYen/DistantAdmirableCareware#main...
        
           | Joker_vD wrote:
           | Ah, so there is no backtracking in lexer for this case. Makes
           | sense.
        
       | ygra wrote:
       | As for C#'s triple-quoted strings, they actually came from Java
       | before and C# ended up adopting the same or almost the same
       | semantics. Including stripping leading whitespace.
        
       | pdw wrote:
       | Some random things that the author seem to have missed:
       | 
       | > but TypeScript, Swift, Kotlin, and Scala take string
       | interpolation to the furthest extreme of encouraging actual code
       | being embedded inside strings
       | 
       | Many more languages support that:                   C#
       | $"{x} plus {y} equals {x + y}"         Python         f"{x} plus
       | {y} equals {x + y}"         JavaScript     `${x} plus ${y} equals
       | ${x + y}`         Ruby           "#{x} plus #{y} equals #{x + y}"
       | Shell          "$x plus $y equals $(echo "$x+$y" | bc)"
       | Make :)        echo "$(x) plus $(y) equals $(shell echo "$x+$y" |
       | bc)"
       | 
       | > Tcl
       | 
       | Tcl is funny because comments are only recognized in code, and
       | since it's a homoiconic, it's very hard to distinguish code and
       | data. { } are just funny string delimiters. E.g.:
       | xyzzy {#hello world}
       | 
       | Is xyzzy a command that takes a code block or a string? There's
       | no way to tell. (Yes, that means that the Tcl tokenizer/parser
       | cannot discard comments: only at evaluation time it's possible to
       | tell if something is a comment or not.)
       | 
       | > SQL
       | 
       | PostgreSQL has the very convenient dollar-quoted strings:
       | https://www.postgresql.org/docs/current/sql-syntax-lexical.h...
       | E.g. these are equivalent:                   'Dianne''s horse'
       | $$Dianne's horse$$         $SomeTag$Dianne's horse$SomeTag$
        
         | autarch wrote:
         | Perl lets you do this too:                   my $foo = 5;
         | my $bar = 'x';         my $quux = "I have $foo $bar\'s: @{[$bar
         | x $foo]}";         print "$quux\n";
         | 
         | This prints out:                   I have 5 x's: xxxxx
         | 
         | The "@{[...]}" syntax is abusing Perl's ability to interpolate
         | an _array_ as well as a scalar. The inner "[...]" creates an
         | array reference and the outer "@{...}" dereferences it.
         | 
         | For reasons I don't remember, the Perl interpreter allows
         | arbitrary code in the inner "[...]" expression that creates the
         | array reference.
        
           | Izkata wrote:
           | > For reasons I don't remember, the Perl interpreter allows
           | arbitrary code in the inner "[...]" expression that creates
           | the array reference.
           | 
           | ...because it's an array value? Aside from how the languages
           | handle references, how is that part any different from, for
           | example, this in python:                 >>> [5 * 'x']
           | ['xxxxx']
           | 
           | You can put (almost) anything there, as long as it's an
           | expression that evaluates to a value. The resulting value is
           | what goes into the array.
        
             | autarch wrote:
             | I understand that's constructing an array. What's a bit odd
             | is that the interpreter allows you to string interpolate
             | any expression when constructing the array reference inside
             | the string.
        
               | Izkata wrote:
               | It's not...? Well, not directly: It's string
               | interpolating an array of values, and the array is
               | constructed using values from the results of expressions.
               | These are separate features that compose nicely.
        
           | weinzierl wrote:
           | You also don't need quotes around strings (barewords). So
           | my $bar = x;
           | 
           | should give the same result.
           | 
           | Good luck with lexing that properly.
           | 
           | https://perlmaven.com/barewords-in-perl
        
         | layer8 wrote:
         | > actual code being embedded inside strings
         | 
         | My view on this is that it shouldn't be interpreted as code
         | being embedded inside strings, but as a special form of string
         | concatenation syntax. In turn, this would mean that you can
         | nest the syntax, for example:                   "foo {
         | toUpper("bar { x + y } bar") } foo"
         | 
         | The individual tokens being (one per line):
         | "foo {         toUpper         (         "bar {         x
         | +         y         } bar"         )         } foo"
         | 
         | If `+` does string concatenation, the above would effectively
         | be equivalent to:                   "foo " + toUpper("bar " +
         | (x + y) + " bar") + " foo"
         | 
         | I don't know if there is a language that actually works that
         | way.
        
           | panzi wrote:
           | Indeed in some of the listed languages you can nest it like
           | that, but in others (e.g. Python) you can't. I would guess
           | they deliberately don't want to enable that and it's not a
           | problem in their parser or something.
        
             | layer8 wrote:
             | Even when nesting is disallowed, my point is that I find it
             | preferable to not view it (and syntax-highlight it) as a
             | "special string" with embedded magic, but as multiple
             | string literals with just different delimiters that allow
             | omitting the explicit concatenation operator, and normal
             | expressions interspersed in between. I think it's important
             | to realize that it is really just very simple syntactic
             | sugar for normal string concatenation.
        
             | Tarean wrote:
             | As of python 3.6 you can nest fstrings. Not all formatters
             | and highlighters have caught up, though.
             | 
             | Which is fun, because correct highlighting depends on
             | language version. Haskell has similar problems where
             | different compiler flags require different parsers. Close
             | enough is sufficient for syntax highlighting, though.
             | 
             | Python is also a bit weird because it calls the format
             | methods, so objects can intercept and react to the format
             | specifiers in the f-string while being formatted.
        
               | panzi wrote:
               | I didn't mean nested f-strings. I mean this is a syntax
               | error:                   >>> print(f"foo {"bar"}")
               | SyntaxError: f-string: expecting '}'
               | 
               | Only this works:                   >>> print(f"foo
               | {'bar'}")         foo bar
        
               | pdw wrote:
               | You're using an old Python version. On recent versions,
               | it's perfectly fine:                   Python 3.12.7
               | (main, Oct  3 2024, 15:15:22) [GCC 14.2.0] on linux
               | Type "help", "copyright", "credits" or "license" for more
               | information.         >>> print(f"foo {"bar"}")
               | foo bar
        
           | epcoa wrote:
           | > "foo { ...
           | 
           | That should probably not be one token.
           | 
           | > My view on this is that it shouldn't be interpreted as code
           | being embedded inside strings
           | 
           | I'm not sure exactly what you're proposing and how it is
           | different. You still can't parse it as a regular lexical
           | grammar.
           | 
           | How does this change how you highlight either?
           | 
           | Whatever you call it, to the lexer it is a special string, it
           | has to know how to match it, the delimiters are materially
           | different than concatenation.
           | 
           | I might be being dense but I'm not sure what's formally
           | distinct.
        
         | panzi wrote:
         | Is this a bash-ism?                   "$x plus $y equals
         | $((x+y))"
        
           | jonahx wrote:
           | This works in "sh" as well for me.
        
             | panzi wrote:
             | On some systems (like on mine) sh is just a link to bash,
             | so I couldn't test it.
        
           | jwilk wrote:
           | No, it's portable shell syntax.
        
           | LukeShu wrote:
           | "$((" arithmetic expansion is POSIX (XCU 2.6.4 "Arithmetic
           | Expansion").
           | 
           | But if I'm not mistaken, it originated in csh.
        
           | susam wrote:
           | > Is this a bash-ism?
           | 
           | > "$x plus $y equals $((x+y))"
           | 
           | No, it is specified in POSIX: https://pubs.opengroup.org/onli
           | nepubs/9699919799/utilities/V...
        
         | therein wrote:
         | > PostgreSQL has the very convenient dollar-quoted strings
         | 
         | I did not know that. Today I learned.
        
         | sundarurfriend wrote:
         | > Many more languages support that:
         | 
         | Julia as well:                   Julia    "$x plus $y equals
         | $(x+y)"
        
         | thesz wrote:
         | VHDL
         | 
         | There is a record constructor syntax in VHDL using attribute
         | invocation syntax: RECORD_TYPE'(field1expr, ..., fieldNexpr).
         | This means that if your record has a first field a subtype of a
         | character type, you can get record construction expression like
         | this one: REC'('0',1,"10101").
         | 
         | Good luck distinguishing between '(' as a character literal and
         | "'", "(" and "'0'" at lexical level.
         | 
         | Haskell.
         | 
         | Haskell has context-free syntax for bracketed ("{-" ... "-}")
         | comments. Lexer has to keep bracketed comment syntax balanced
         | (for every "{-" there should be accompanying "-}" somewhere).
        
         | 1vuio0pswjnm7 wrote:
         | Shell "$x plus $y equals $((x+y))"
         | 
         | Shell "$x plus $y equals $((expr $x + $y))"
        
       | __MatrixMan__ wrote:
       | This was a fun read, but it left me a bit more sympathetic to the
       | lisp perspective, which (if I've understood it) is that syntax,
       | being not an especially important part of a language, is more of
       | a hurdle than a help, and should be as simple and uniform as
       | possible so we can focus on other things.
       | 
       | Which is sort of ironic because learning how to do structural
       | editing on lisps has absolutely been more hurdle than help so
       | far, but I'm sure it'll pay off eventually.
        
         | mqus wrote:
         | Having a simple syntax might be fine for computers but syntax
         | is mainly designed to be read and written by humans. Having a
         | simple one like lisp then just makes syntactic discussions a
         | semantic problem, just shifting the layers.
         | 
         | And I think an complex syntax is far easier to read and write
         | than a simple syntax with complex semantics. You also get a
         | faster feedback loop in case the syntax of your code is wrong
         | vs the semantics (which might be undiscovered until runtime).
        
           | drewr wrote:
           | I don't understand your distinction between syntax and
           | semantics. If the semantics are complex, wouldn't that mean
           | the syntax is thus complex?
        
             | SuperCuber wrote:
             | lisp's syntax is simple - its just parenthesis to define a
             | list, first element of a list is executed as a function.
             | 
             | but for example a language like C has many different
             | syntaxes for different operations, like function
             | declaration or variable or array syntax, or if/switch-case
             | etc etc.
             | 
             | so to know C syntax you need to learn all these different
             | ways to do different things, but in lisp you just need to
             | know how to match parenthesis.
             | 
             | But of course you still want to declare variables, or have
             | if/else and switch case. So you instead need to learn the
             | builtin macros (what GP means by semantics) and their
             | "syntax" that is technically not part of the language's
             | syntax but actually is since you still need all those
             | operations enough that they are included in the standard
             | library and defining your own is frowned upon.
        
             | skydhash wrote:
             | Most languages' abstract machines expose a very simple API,
             | it's up to the language to add useful constructs to help us
             | write code more efficiently. Languages like Lisp start with
             | a very simple syntax, then add those constructs with the
             | language itself (even though those can be fixed using a
             | standard), others just add it through the syntax. These
             | constructs plus the abstract machine's operations form the
             | semantics, syntax is however the language designer decided
             | to present them.
        
           | __MatrixMan__ wrote:
           | Jury's out re: whether I feel this in my gut. Need more time
           | with the lisps for that. But re: cognitive load maybe it goes
           | like:
           | 
           | 1. 1 language to rule them all, fancy syntax
           | 
           | 2. Many languages, 1 simple syntax to rule them all
           | 
           | 3. Many languages and many fancy syntaxes
           | 
           | Here in the wreckage of the tower of babel, 1. isn't really
           | on the table. But 2. might have benefits because the
           | inhumanity of the syntax need only be confronted once. The
           | cumulative cost of all the competing opinionated fancy
           | syntaxes may be the worst option. Think of all the hours lost
           | to tabs vs spaces or braces vs whitespace.
        
             | dartos wrote:
             | I think 3 is not only a natural state, but the best state.
             | 
             | I don't think we can have 1 language that satisfies the
             | needs of all people who write code, and thus, we can't have
             | 1 syntax that does that either.
             | 
             | 3 seems the only sensible solution to me, and we have it.
        
               | __MatrixMan__ wrote:
               | I dunno, here in 3 the hardest part of learning a
               | language has little to do with the language itself and
               | more to do with the ecosystem of tooling around that
               | language. I think we could more easily get on to the
               | business of using the right language for the job if more
               | of that tooling was shared. If each language, for
               | instance did not have it's own package manager, its own
               | IDE, its own linters and language servers all with their
               | own idiosyncrasies arising not from deep philosophical
               | differences of the associated language but instead from
               | accidental quirks of perspective from whoever decided
               | that their favorite language needed a new widget.
               | 
               | I admire the widget makers, especially those wrangling
               | the gaps between languages. I just wish their work could
               | be made easier.
        
               | skydhash wrote:
               | I really like the Linux package managers. If you're going
               | to write an application that will run on some system,
               | it's better to bake dependencies into it. And with
               | virtualization and containerization, the system is not
               | tied to a physical machine. I've been using containers
               | (incus) more and more for real development purposes as I
               | can use almost the same environment to deploy. I don't
               | care much about the IDE, but I'm glad we have LSP, Tree-
               | sitter, and DAP. The one thing I do not like is the
               | proliferation of tooling version manager (NVM,..) instead
               | of managing the environment itself (tied to the project).
        
         | nlitened wrote:
         | I am surprised to hear that structural editing has been a
         | hurdle for you, and I think I can offer a piece of advice. I
         | also used to be terrified by its apparent complexity, but later
         | found out that one just needs to use parinfer and to know key
         | bindings for only three commands: slurp, barf, and raise.
         | 
         | With just these four things you will be 95% there, enjoying the
         | fruits of paredit without any complexity -- all the remaining
         | tricks you can learn later when you feel like you're fluent.
        
           | __MatrixMan__ wrote:
           | Thanks very much for the advice, it's timely.
           | 
           | <rant> It's not so much the editing itself but the
           | unfamiliarity of the ecosystem. It seems it's a square-peg
           | I've been crafting a round hole of habits for it:
           | 
           | I guess I should use emacs? How to even configure it such
           | that these actions are available? Or maybe I should write a
           | plugin for helix so that I can be in a familiar environment.
           | Oh, but the helix plugin language is a scheme, so I guess
           | I'll use emacs until I can learn scheme better and then write
           | that plugin. Oh but emacs keybinds are conflicting with what
           | I've configured for zellij, maybe I can avoid conflicts by
           | using evil mode? Oh ok, emacs-lisp, that's a thing. Hey symex
           | seems like it aligns with my modal brain, oh but there goes
           | another afternoon of fussing with emacs. Found and reported a
           | symex "bug" but apparently it only appears in nix-governed
           | environments so I guess I gotta figure out how to report the
           | packaging bug (still todo). Also, I guess I might as well
           | figure out how to get emacs to evaluate expressions based on
           | which ones are selected, since that's one of the fun things
           | you can do in lisps, but there's no plugin for the scheme
           | that helix is using for its plugin language (which is why I'm
           | learning scheme in the first place), but it turns out that AI
           | is weirdly good at configuring emacs so now my emacs config
           | contains most that that plugin would entail. Ok, now I'm
           | finally ready to learn scheme, I've got this big list of new
           | actions to learn: https://countvajhula.com/2021/09/25/the-
           | animated-guide-to-sy.... Slurp, barf, and raise you say?
           | excellent, I'll focus on those.
           | 
           | I'm not actually trying to critique the unfamiliar space.
           | These are all self inflicted wounds: me being persnickety
           | about having it my way. It's just usually not so difficult to
           | use something new and also have it my way.</rant>
        
             | xenophonf wrote:
             | I never bothered with structural editing on Emacs. I just
             | use the sentence/paragraph movement commands. M-a, M-e,
             | M-n, M-p, M-T, M-space, etc.
        
             | nlitened wrote:
             | To be fair, I am not a "lisper" and I don't know Emacs at
             | all. I am just a Clojure enjoyer who uses IntelliJ +
             | Cursive with its built-in parinfer/paredit.
        
             | pxc wrote:
             | > Oh but emacs keybinds are conflicting with what I've
             | configured for zellij,
             | 
             | Don't do that. ;)
             | 
             | Emacs is a graphical application! Don't use it in the
             | terminal unless you really have to (i.e., you're using it
             | on a remote machine and TRAMP will not do).
             | 
             | > it turns out that AI is weirdly good at configuring emacs
             | 
             | I was just chatting with a friend about this. ChatGPT seems
             | to be much better at writing ELisp than many other
             | languages I've asked it to work with.
             | 
             | Also while you're playing with it, you might be interested
             | in checking out kakoune.el or meow, which provide modal
             | editing in Emacs but with the selection-first ordering for
             | commands, like in Kakoune and Helix rather than the old vi
             | way.
             | 
             | PS: symex looks really interesting! Hadn't been that one
        
         | fanf2 wrote:
         | Lisp has reader macros which allow you to reprogram its lexer.
         | Lisp macros allow you to program the translation from the
         | visible structure to the parse tree.
         | 
         | For example, https://pyret.org/
         | 
         | It really isn't simple or necessarily uniform.
        
           | __MatrixMan__ wrote:
           | I've heard that certain lisps (Common Lisp comes up when I
           | search for reader macros) allow for all kinds of tinkering
           | with themselves. But the ability of one to make itself not a
           | lisp anymore, while interesting, doesn't seem to say much
           | about the merits of sticking to s-expressions, except maybe
           | to point out that somebody once decided not to.
        
       | kazinator wrote:
       | I don't think it's easy to write a good syntax coloring engine
       | like the one in Vim.
       | 
       | Syntax coloring has to handle context: different rules for
       | material nested in certain ways.
       | 
       | Vim's syntax higlighter lets you declare two kinds of items:
       | matches and regions. Matches are simpler lexical rules, whereas
       | regions have separate expressions for matching the start and end
       | and middle. There are ways to exclude leading and trailing
       | material from a region.
       | 
       | Matches and regions can declare that they are contained. In that
       | case they are not active unless they occur in a containing
       | region.
       | 
       | Contained matches declare which regions contain them.
       | 
       | Regions declare which other regions they contain.
       | 
       | That's the basic semantic architecture; there are bells and
       | whistles in the system due to situations that arise.
       | 
       | I don't think even Justine could develop that in an interview,
       | other than as an overnight take home.
        
         | kazinator wrote:
         | Here is an example of something hard to handle: TXR language
         | with embedded TXR Lisp.
         | 
         | This is the "genman" script which takes the raw output of a
         | manpage to HTML converter, and massages it to form the HTML
         | version of the TXR manual:
         | 
         | https://www.kylheku.com/cgit/txr/tree/genman.txr
         | 
         | Everything that is white (not colored) is literal template
         | material. Lisp code is embedded in directives, like @(do ...).
         | In this scheme, TXR keywords appear purple, TXR Lisp ones
         | green. They can be the same; see the (and ...) in line 149,
         | versus numerous occurrences of @(and).
         | 
         | Quasistrings contain nested syntax: see 130 where `<a href ..>
         | ... </a>` contains an embedded (if ...). That could itself
         | contain a quasistring with more embedded code.
         | 
         | TXR's _txr.vim " and _tl.vim* syntax definition files are both
         | generated by this:
         | 
         | https://www.kylheku.com/cgit/txr/tree/genvim.txr
        
         | saghm wrote:
         | Naively, I would have assumed that the "correct" way to write a
         | syntax highlighter would be to parse into an AST and then
         | iterate over the tokens and update the color of a token based
         | on the type of node (and maybe just tracking a diff to avoid
         | needing to recolor things that haven't changed). I'm guessing
         | that if this isn't done, it's for efficiency reasons (e.g. due
         | to requiring parsing the whole file to highlight rather than
         | just the part currently visible on the screen)?
        
           | Someone wrote:
           | > I would have assumed that the "correct" way to write a
           | syntax highlighter would be to parse into an AST and then
           | [...] I'm guessing that if this isn't done, it's for
           | efficiency reasons
           | 
           | It's not only running time, but also ease of implementation.
           | 
           | A good syntax highlighter should do a decent job highlighting
           | both valid and invalid programs (rationale: in most (editor,
           | language) pairs, writing a program involves going through
           | moments where the program being written isn't a valid
           | program)
           | 
           | If you decide to use an AST, that means you need to have good
           | heuristics for turning invalid programs into valid ones that
           | best mimic what the programmer intended. That can be
           | difficult to achieve (good compilers have such heuristics,
           | but even if you have such a compiler, chances are it isn't
           | possible to reuse them for syntax coloring)
           | 
           | If this simpler approach gives you most of what you can get
           | with the AST approach, why bother writing that?
           | 
           | Also, there are languages where some programs can't be
           | perfectly parsed or syntax colored without running them. For
           | those, you need this approach.
        
       | susam wrote:
       | > Every C programmers (sic) knows you can't embed a multi-line
       | comment in a multi-line comment.
       | 
       | And every Standard ML programmer might find this to be a
       | surprising limitation. The following is a valid Standard ML
       | program:                 (* (* Nested (**) *) comment *)
       | val _ = print "hello, world\n"
       | 
       | Here is the output:                 $ sml < hello.sml
       | Standard ML of New Jersey (64-bit) v110.99.5 [built: Thu Mar 14
       | 17:56:03 2024]       - = hello, world            $ mlton
       | hello.sml && ./hello       hello, world
       | 
       | Given how C was considered one of the "expressive" languages when
       | it arrived, it's curious that nested comments were never part of
       | the language.
        
         | dahart wrote:
         | There are 3 things I find funny about that comment: ML didn't
         | have single-line comments, so same level of surprising
         | limitation. I've never heard someone refer to C as
         | "expressive", but maybe it was in 1972 when compared to
         | assembly. And what bearing does the comment syntax have on the
         | expressiveness of a language? I would argue absolutely none at
         | all, by _definition_. :P
        
           | susam wrote:
           | > ML didn't have single-line comments, so same level of
           | surprising limitation.
           | 
           | It is not quite clear to me why the lack of single-line
           | comments is such a surprising limitation. After all, a
           | single-line block comment can easily serve as a substitute.
           | However, there is no straightforward workaround for the lack
           | of nested block comments.
           | 
           | > I've never heard someone refer to C as "expressive", but
           | maybe it was in 1972 when compared to assembly.
           | 
           | I was thinking of Fortran in this context. For instance,
           | Fortran 77 lacked function pointers and offered a limited set
           | of control flow structures, along with cumbersome support for
           | recursion. I know Fortran, with its native support for
           | multidimensional arrays, excelled in numerical and scientific
           | computing but C quickly became the preferred language for
           | general purpose computing.
           | 
           | While very few today would consider C a pinnacle of
           | expressiveness, when I was learning C, the landscape of
           | mainstream programming languages was much more restricted. In
           | fact, the preface to the first edition of K&R notes the
           | following:
           | 
           |  _" In our experience, C has proven to be a pleasant,
           | expressive and versatile language for a wide variety of
           | programs."_
           | 
           | C, Pascal, etc. stood out as some of the few mainstream
           | programming languages that offered a reasonable level of
           | expressiveness. Of course, Lisp was exceptionally expressive
           | in its own right, but it wasn't always the best fit for
           | certain applications or environments.
           | 
           | > And what bearing does the comment syntax have on the
           | expressiveness of a language?
           | 
           | Nothing at all. I agree. The expressiveness of C comes from
           | its grammar, which the language parser handles. Support for
           | nested comments, in the context of C, is a concern for the
           | lexer, so indeed one does not directly influence the other.
           | However, it is still curious that a language with such a
           | sophisticated grammar and parser could not allocate a bit of
           | its complexity budget to support nested comments in its
           | lexer. This is a trivial matter, I know, but I still couldn't
           | help but wonder about it.
        
             | dahart wrote:
             | Fair enough. From my perspective, lack of single line
             | comments is a little surprising because most other
             | languages had it at the time (1973, when ML was
             | introduced). Lack of nested comments doesn't seem
             | surprising, because it isn't an important feature for a
             | language, and because most other languages did not have it
             | at the time (1972, when C was introduced).
             | 
             | I can imagine both pro and con arguments for supporting
             | nested comments, but regardless of what I think, C
             | certainly could have added support for nested comments at
             | any time, and hasn't, which suggests that there isn't
             | sufficient need for it. That might be the entire
             | explanation: not even worth a little complexity.
        
               | masfuerte wrote:
               | AFAIK, C didn't get single line comments until C99. They
               | were a C++ feature originally.
        
               | dahart wrote:
               | Oh wow, I didn't remember that, and I did start writing C
               | before 99. I stand corrected. I guess that is a little
               | surprising. ;)
               | 
               | Is true that many languages had single line comments?
               | Maybe I'm forgetting more, but I remember everything else
               | having single line comments... asm, basic, shell. I used
               | Pascal in the 80s and apparently forgot it didn't have
               | line comments either?
        
               | masfuerte wrote:
               | That's my recollection, that most languages had single
               | line comments. Some had multi-line comments but C++ is
               | the first I remember having syntaxes for both. That said,
               | I'm not terribly familiar with pre-80s stuff.
        
               | susam wrote:
               | > C certainly could have added support for nested
               | comments at any time
               | 
               | After C89 was ratified, adding nested comments to C would
               | have risked breaking existing code. For instance, this is
               | a valid program in C89:                 #include
               | <stdio.h>            int main() {           /* /* Comment
               | */           printf("hello */ world");           return
               | 0;       }
               | 
               | However, if a later C standard were to introduce nested
               | comments, it would break the above program because then
               | the following part of the program would be recognised as
               | a comment:                     /* /* Comment */
               | printf("hello */
               | 
               | The above text would be ignored. Then the compiler would
               | encounter the following:                     world");
               | 
               | This would lead to errors like _undeclared identifier
               | 'world'_, _missing terminating " character_, etc.
        
             | pklausler wrote:
             | > Fortran 77 lacked function pointers
             | 
             | But we did have dummy procedures, which covered one of the
             | important use cases directly, and which could be abused to
             | fake function/subroutine pointers stored in data.
        
         | gsliepen wrote:
         | Well there is one way to nest comments in C, and that's by
         | using #if 0:                 #if 0       This is a       #if 0
         | nested comment!       #endif       #endif
        
           | fanf2 wrote:
           | Except that text inside #if 0 still has to lex correctly.
           | 
           | (unifdef has some evil code to support using C-style
           | preprocessor directives with non-C source, which mostly boils
           | down to ignoring comments. I don't recommend it!)
        
             | dahart wrote:
             | > Except that text inside #if 0 still has to lex correctly.
             | 
             | Are you sure? I just tried on godbolt and that's not true
             | with gcc 14.2. I've definitely put syntax errors
             | intentionally into #if 0 blocks and had it compile. Are you
             | thinking of some older version or something? I thought the
             | pre-processor ran before the lexer since always...
        
               | fanf2 wrote:
               | There are three (relevant) phases (see "translation
               | phases" in section 5 of the standard):
               | 
               | * program is lexed into preprocessing tokens; comments
               | turn into whitespace
               | 
               | * preprocessor does its thing
               | 
               | * preprocessor tokens are turned into proper tokens;
               | different kinds of number are disambiguated; keywords and
               | identifiers are disambiguated
               | 
               | If you put an unclosed comment inside #if 0 then it won't
               | work as you might expect.
        
               | dahart wrote:
               | Ah, I see. You're right!
        
         | kragen wrote:
         | This is not just true of Standard ML; it's also true of regular
         | ML.
        
         | layer8 wrote:
         | Lexing nested comments requires maintaining a stack (or at
         | least a nesting-level counter). That wasn't traditionally seen
         | as being within the realm of lexical analysis, which would only
         | use a finite-state automaton, like regular expressions.
        
       | lupire wrote:
       | > You'll notice its hash function only needs to consider a single
       | character in in a string. That's what makes it perfect,
       | 
       | Is that a joke?
       | 
       | https://en.m.wikipedia.org/wiki/Perfect_hash_function
        
       | playingalong wrote:
       | Nice read.
       | 
       | I guess the article could be called Falsehoods Programmers Assume
       | of Programming Language Syntaxes.
        
       | TomatoCo wrote:
       | I think my favorite C trigraph was something like
       | do_action() ??!??! handle_error()
       | 
       | It almost looks like special error handling syntax but still
       | remains satisfying once you realize it's an || logical-or
       | statement and it's using short circuiting rules to execute handle
       | error if the action returns a non-zero value.
        
         | wslh wrote:
         | Did you choose the legacy C trigraphs over || for aesthetic
         | purposes?
        
           | wslh wrote:
           | Could you review my comment on HN? Please educate me if there
           | is something I haven't understood, rather than downvoting my
           | question.
        
             | samatman wrote:
             | The grandparent post is specifically about trigraphs.
             | Saying something about trigraphs was the end-in-itself,
             | trigraphs were chosen to illustrate something about
             | trigraphs. So your question made no sense. Hope that helps.
        
       | IshKebab wrote:
       | I don't understand why you wouldn't use Tree Sitter's syntax
       | highlighting for this. I mean it's not going to be as fast but
       | that clearly isn't an issue here.
       | 
       | Is this a "no third party dependencies" thing?
        
         | jart wrote:
         | I don't want to require everyone who builds llamafile from
         | source need to install rust. I don't even require that people
         | install the gperf command, since I can build gperf as a 700kb
         | actually portable executable and vendor it in the repo. Tree
         | sitter I'd imagine does a really great highly precise job with
         | the languages it supports. However it appears to support fewer
         | of them than I am currently. I'm taking a breadth first
         | approach to syntax highlighting, due to the enormity of
         | languages LLMs understand.
        
           | IshKebab wrote:
           | I think the Rust component of tree-sitter-highlight is
           | actually pretty small (Tree Sitter generates C for the actual
           | parser).
           | 
           | But fair enough - fewer dependencies is always nice,
           | especially in C++ (which doesn't have a modern package
           | manager) and in ML where an enormous janky Python
           | installation is apparently a perfectly normal thing to
           | require.
        
             | mdaniel wrote:
             | I somehow thought Conan[1] was the C++ package manager;
             | it's at least partially supported by GitLab, for what
             | that's worth
             | 
             | 1: https://docs.conan.io/2/introduction.html
        
               | IshKebab wrote:
               | No, if anything vcpkg is "the C++ package manager", but
               | it's nowhere near pervasive and easy-to-use enough to
               | come close to even Pip. It's leagues away from Cargo, Go,
               | and other _actually good_ PL package managers.
        
       | jim_lawless wrote:
       | Forth has a default syntax, but Forth code can execute during the
       | compilation process allowing it to accept/compile custom
       | syntaxes.
        
       | SonOfLilit wrote:
       | Justine gets very close to the hairiest parsing issue in any
       | language without encountering it:
       | 
       | Perl's syntax is undecidable, because the difference between
       | treating some characters as a comment or as a regex can depend on
       | the type of a variable that is only determined e.g. based on
       | whether a search for a Collatz counterexample terminates, or
       | just, you know, user input.
       | 
       | https://perlmonks.org/?node_id=663393
       | 
       | C++ templates have a similar issue, I think.
        
         | fanf2 wrote:
         | I think possibly the most hilariously complicated instance of
         | this is in perl's tokenizer, toke.c (which starts with a
         | Tolkien quote, 'It all comes from here, the stench and the
         | peril.' -- Frodo).
         | 
         | There's a function called intuit_more which works out if
         | $var[stuff] inside a regex is a variable interpolation followed
         | by a character class, or an array element interpolation. Its
         | result can depend on whether something in the stuff has been
         | declared as a variable or not.
         | 
         | But even if you ignore the undecidability, the rest is still
         | ridiculously complicated.
         | 
         | https://github.com/Perl/perl5/blob/blead/toke.c#L4502
        
         | swolchok wrote:
         | > C++ templates have a similar issue
         | 
         | TIL! I went and dug up a citation:
         | https://blog.reverberate.org/2013/08/parsing-c-is-literally-...
        
         | layer8 wrote:
         | How could a search for a Collatz counterexample possibly
         | terminate? ;)
        
       | petesergeant wrote:
       | > Perl also has this goofy convention for writing man pages in
       | your source code
       | 
       | The world corpus of software would be much better documented if
       | everywhere else had stolen this from Perl. Inline POD is great.
        
         | kragen wrote:
         | Perl and Python stole it from Emacs Lisp, though Perl took it
         | further. I'm not sure where Java stole it from, but nowadays
         | Doxygen is pretty common for C code. Unfortunately this results
         | in people thinking that Javadoc and Doxygen are substitutes for
         | actual documentation like the Emacs Lisp Reference Manual,
         | which cannot be generated from docstrings, because the
         | organization of the source code is hopelessly inadequate for a
         | reference manual.
        
           | mdaniel wrote:
           | > Emacs Lisp Reference Manual, which cannot be generated from
           | docstrings, because the organization of the source code is
           | hopelessly inadequate for a reference manual.
           | 
           | Well, they're not doing themselves any favors by just willy
           | nilly mixing C with "user-facing" defuns <https://emba.gnu.or
           | g/emacs/emacs/-/blob/ed1d691184df4b50da6b...>. I was curious
           | if they could benefit from "literate programming" since
           | OrgMode is _the bee 's knees_ but not with that style coding
           | they can't
        
       | metadat wrote:
       | _> The languages I decided to support are Ada, Assembly, BASIC,
       | C, C#, C++, COBOL, CSS, D, FORTH, FORTRAN, Go, Haskell, HTML,
       | Java, JavaScript, Julia, JSON, Kotlin, ld, LISP, Lua, m4, Make,
       | Markdown, MATLAB, Pascal, Perl, PHP, Python, R, Ruby, Rust,
       | Scala, Shell, SQL, Swift, Tcl, TeX, TXT, TypeScript, and Zig._
       | 
       | A few (admittedly silly) questions about the list:
       | 
       | 1. Why no Erlang, Elixir, or Crystal?
       | 
       | Erlang appears to be just at the author's boundary at #47 on the
       | TIOBE index. https://www.tiobe.com/tiobe-index/
       | 
       | 2. What is _" Shell"_? Sh, Bash, Zsh, Windows Cmd, PowerShell..?
       | 
       | 3. Perl but no Awk? Curious why, because Awk is a similar but
       | comparatively trivial language. Widely used, too.
       | 
       | To be fair, Awk, Erlang, and Elixir rank low on popularity. Yet
       | m4, Tcl, TeX, and Zig aren't registered in the top 50 at all.
       | 
       | What's the methodology / criteria? Only things the author is
       | already familiar with?
       | 
       | Still a fun article.
        
         | Yasuraka wrote:
         | Tiobes's index is quite literally worthless, especially with
         | regards to its stated purpose, let alone as a general point of
         | orientation.
         | 
         | I'd wish that purple would stop lending it any credibility.
        
       | dakiol wrote:
       | Wouldn't be possible to let the LLM do the highlighting? Instead
       | of returning code in plain text, it could return code within html
       | with the appropriate tags. Maybe it's harder than it sounds...
       | but if it's just for highlighting the code the LLM returns, I
       | wouldn't mind the highlighting not being 100% accurate.
        
         | trashburger wrote:
         | Would be much slower and eat up precious context window.
        
       | layer8 wrote:
       | The author may have missed that lexing C is actually context-
       | sensitive, i.e. you need a symbol table:
       | https://en.wikipedia.org/wiki/Lexer_hack
       | 
       | Of course, for syntax highlighting this is only relevant if you
       | want to highlight the multiplication operator differently from
       | the dereferencing operator, or declarations differently from
       | expressions.
       | 
       | More generally, however, I find it useful to highlight (say)
       | types differently from variables or functions, which in some
       | (most?) popular languages requires full parsing and symbol table
       | information. Some IDEs therefore implement two levels of syntax
       | highlighting, a basic one that only requires lexical information,
       | and an extended one that kicks in when full grammar and type
       | information becomes available.
        
         | legobmw99 wrote:
         | I'd be shocked if jart didn't know this, but it seems unlikely
         | that an LLM would generate one of these most vexing parses,
         | unless explicitly asked
        
           | layer8 wrote:
           | Given all the things that were new to the author in the
           | article, I wouldn't be shocked at all. There's just a huge
           | number of things to know, or to have come across.
        
           | quietbritishjim wrote:
           | I think you're thinking of something different to the issue
           | in the parent comment. The most vexing parse is, as the name
           | suggests, a problem at the parsing stage rather than the
           | earlier lexing phase. Unlike the referenced lexing problem,
           | it does't require any hack for compilers to deal with it.
           | That's because it's not really a problem for the compiler;
           | it's humans that find it surprising.
        
       | murkt wrote:
       | Author hasn't tried to highlight TeX. Which is good for their
       | mental health, I suppose, as it's generally impossible to fully
       | highlight TeX without interpreting it.
       | 
       | Even parsing is not enough, as it's possible to redefine what
       | each character does. You can make it do things like "and now K
       | means { and C means }".
       | 
       | Yes, you can find papers on arXiv that use this god-forsaken
       | feature.
        
         | jart wrote:
         | I wrote https://github.com/Mozilla-
         | Ocho/llamafile/blob/main/llamafil... and it does a reasonable
         | job highlighting without breaking for all the .tex files I
         | could find on my hard drive. My goal is to hopefully cover
         | 99.9% of real world usage, since that'll likely cover
         | everything an LLM might output. Esoteric syntax also usually
         | isn't a problem, so long as it doesn't cause strings and
         | comments to extend forever, eclipsing the rest of the source
         | code in a file.
        
         | nathell wrote:
         | Same with Common Lisp (you can redefine the read table),
         | although that's likely abused less often on arXiv.
        
         | bobbylarrybobby wrote:
         | I couldn't believe it when I learned that \makeatletter does
         | not "make (something) at a letter (character)" but rather
         | "treats the '@' character as a letter when parsing".
        
       | xonix wrote:
       | No AWK?
        
       | sundarurfriend wrote:
       | The final line number count is missing Julia. Based on the file
       | in the repo, it would be at the bottom of the first column:
       | between ld and R.
       | 
       | Among the niceties listed here, the one I'd wish for Julia to
       | have would be C#'s "However many quotes you put on the lefthand
       | side, that's what'll be used to terminate the string at the other
       | end". Documentation that talks about quoting would be so much
       | easier to read (in source form) with something like that.
        
       | nusaru wrote:
       | > Ruby is the union of all earlier languages, and it's not even
       | formally documented.
       | 
       | It's documented, but you need $250 to spare:
       | https://www.iso.org/standard/59579.html
        
         | mdaniel wrote:
         | Well, according to (ahem) _a copy_ that I found, it only goes
         | up to MRI 1.9 and goes out of its way to say  "welp, the world
         | is changing, so we're just going to punt until Ruby stabilizes"
         | which is damn cheating for a _standard_ IMHO
         | 
         | Also, while doing some digging I found there actually are a
         | number of the standards that are legitimately publicly
         | available
         | https://standards.iso.org/ittf/PubliclyAvailableStandards/in...
        
       ___________________________________________________________________
       (page generated 2024-11-02 23:00 UTC)