hngopher.com

       [HN Gopher] Weird Lexical Syntax
       ___________________________________________________________________
        
       Weird Lexical Syntax
        
       Author : jart
       Score  : 410 points
       Date   : 2024-11-02 07:45 UTC (1 days ago)
        
 (HTM) web link (justine.lol)
 (TXT) w3m dump (justine.lol)
        
       | llm_trw wrote:
       | I've done a fair bit of forth and I've not seen c" used. The
       | usual string printing operator is ." .
        
         | mananaysiempre wrote:
         | Counted ("Pascal") strings are rare nowadays so C" is not often
         | used. Its addr len equivalent is S" and that one is fairly
         | common in string manipulation code.
        
         | kragen wrote:
         | Right, _c "_ is for when you want to pass a literal string to
         | some other word, not print it. But I agree that it's not very
         | common, because you normally use _s "_ for that, which leaves
         | the address and length on the stack, while _c "_ leaves just an
         | address on the stack, pointing to a one-byte count field
         | followed by the bytes. I think adding _c "_ in Forth-83 (and
         | renaming _"_ to _s "_) was a mistake, and it would have been
         | better to deprecate the standard words that expect or produce
         | such counted strings, other than _count_ itself. See
         | https://forth-standard.org/standard/alpha, https://forth-
         | standard.org/standard/core/Cq, https://forth-
         | standard.org/standard/core/COUNT, and https://forth-
         | standard.org/standard/core/Sq.
         | 
         | You can easily add new string and comment syntaxes to Forth,
         | though. For example, you can add BCPL-style // comments to end
         | of line with this line of code in, I believe, all standard
         | Forths, though I've only tested it in GForth:
         | : // 10 word drop ; immediate
         | 
         | Getting it to work in block files requires more work but is
         | still only a few lines of code. The standard word _\_ does
         | this, and _see \_ decompiles the GForth implementation as
         | : \         blk @         IF     >in @ c/l / 1+ c/l * >in !
         | EXIT         THEN         source >in ! drop ; immediate
         | 
         | This kind of thing was commonly done for text editor commands,
         | for example; you might define _i_ as a word that reads text
         | until the end of the line and inserts it at the current
         | position in the editor, rather than discarding it like my  //
         | above. Among other things, the screen editor in F83 does
         | exactly that.
         | 
         | So, as with Perl, PostScript, TeX, m4, and Lisps that support
         | readmacros, you can't lex Forth without executing it.
        
       | skrebbel wrote:
       | This was a delightful read, thanks!
        
       | croisillon wrote:
       | Glad to see confirmed that PHP is the most non weird programming
       | language ;)
        
         | rererereferred wrote:
         | I recently learned php's heredoc can have space before it and
         | it will remove those spaces from the lines in the string:
         | $a = <<<EOL             This is             not indented
         | but this has 4 spaces of indentation             EOL;
         | 
         | But the spaces have to match, if any line has less spaces than
         | the EOL it gives an error.
        
         | alganet wrote:
         | There are two types of languages: the ones full of quirks and
         | the ones no one uses.
        
       | skitter wrote:
       | Another syntax oddity (not mentioned here) that breaks most
       | highlighters: In Java, unicode escapes can be anywhere, not just
       | in strings. For example, the following is a valid class:
       | class Foo\u007b}
       | 
       | and this assert will not trigger:                   assert
       | // String literals can have unicode escapes like \u000A!
       | "Hello World".equals("\u00E4");
        
         | ivanjermakov wrote:
         | I have never seen this in Java! Is there any use cases where it
         | could be useful?
        
           | susam wrote:
           | I don't know about usefulness but it does let us write
           | identifiers using Unicode characters. For example:
           | public class Foo {           public static void main(String[]
           | args) {               double \u03c0 = 3.14159265;
           | System.out.println("\u03c0 = " + \u03c0);           }       }
           | 
           | Output:                 $ javac Foo.java && java Foo       p
           | = 3.14159265
           | 
           | Of course, nowadays we can simply write this with any decent
           | editor:                 public class Foo {           public
           | static void main(String[] args) {               double p =
           | 3.14159265;               System.out.println("p = " + p);
           | }       }
           | 
           | Support for Unicode escape sequences is a result of how the
           | Java Language Specification (JLS) defines InputCharacter.
           | Quoting from Section 3.4 of JLS
           | <https://docs.oracle.com/javase/specs/jls/se23/jls23.pdf>:
           | InputCharacter:         UnicodeInputCharacter but not CR or
           | LF
           | 
           | UnicodeInputCharacter is defined as the following in section
           | 3.3:                 UnicodeInputCharacter:
           | UnicodeEscape         RawInputCharacter
           | UnicodeEscape:         \ UnicodeMarker HexDigit HexDigit
           | HexDigit HexDigit            UnicodeMarker:         u {u}
           | HexDigit:         (one of)         0 1 2 3 4 5 6 7 8 9 a b c
           | d e f A B C D E F            RawInputCharacter:         any
           | Unicode character
           | 
           | As a result the lexical analyser honours Unicode escape
           | sequences absolutely anywhere in the program text. For
           | example, this is a valid Java program:                 public
           | class Bar {           public static void
           | \u006d\u0061\u0069\u006e(String[] args) {
           | System.out.println("hello, world");           }       }
           | 
           | Here is the output:                 $ javac Bar.java && java
           | Bar       hello, world
           | 
           | However, this is an incorrect Java program:
           | public class Baz {           // This comment contains \u6d.
           | public static void main(String[] args) {
           | System.out.println("hello, world");           }       }
           | 
           | Here is the error:                 $ javac Baz.java
           | Baz.java:2: error: illegal unicode escape           // This
           | comment contains \u6d.
           | ^       1 error
           | 
           | Yes, this is an error even if the illegal Unicode escape
           | sequence occurs in a comment!
        
             | ivanjermakov wrote:
             | I wonder if full unicode range was accepted because some
             | companies are writing code in non-english.
        
           | layer8 wrote:
           | Javac uses the platform encoding [0] by default to interpret
           | Java source files. This means that Java source code files are
           | inherently non-portable. When Java was first developed (and
           | for a long time after), this was the default situation for
           | any kind of plain text files. The escape sequence syntax
           | allows to transform [1] Java source code into a portable
           | (that is, ASCII-only) representation that is completely
           | equivalent to the original, and also to convert it back to
           | any platform encoding.
           | 
           | Source control clients could apply this automatically upon
           | checkin/checkout, so that clients with different platform
           | encodings can work together. Alternatively, IDEs could do
           | this when saving/loading Java source files. That never quite
           | caught on, and the general advice was to stick to ASCII, at
           | least outside comments.
           | 
           | [0] Since JDK 18, the default encoding defaults to UTF-8.
           | This probably also extends to _javac_ , though I haven't
           | verified it.
           | 
           | [1] https://docs.oracle.com/javase/8/docs/technotes/tools/win
           | dow...
        
         | mistercow wrote:
         | I also argue that failing to syntax highlight this correctly is
         | a security issue. You can terminate block comments with Unicode
         | escapes, so if you wanted to hide some malicious code in a Java
         | source file, you just need an excuse for there to be a block of
         | Unicode escapes in a comment. A dev who doesn't know about this
         | quirk is likely to just skip over it, assuming it's commented
         | out.
        
           | styglian wrote:
           | I once wrote a puzzle using this, which (fortunately) doesn't
           | work any more, but would do interesting things on older JDK
           | versions: https://pastebin.com/raw/Bh81PwXY
        
       | mcphage wrote:
       | At one point there was an open source project to formally specify
       | Ruby, but I don't know if it's still alive:
       | https://github.com/ruby/spec
       | 
       | Hmm, it seems to be alive, but based more on behavior than
       | syntax.
        
       | keybored wrote:
       | Meanwhile NeoVim doesn't syntax highlight my commit message
       | properly if I have messed with "commit cleanup" enough.
       | 
       | The comment character in Git commit messages can be a problem
       | when you insist on prepending your commits with some "id" and the
       | id starts with `#`. One suggestion was to allow backslash escapes
       | in commit messages since that makes sense to a computer
       | scientist.[1]
       | 
       | But looking at all of this lexical stuff I wonder if makes-sense-
       | to-computer-scientist is a good goal. They invented the problem
       | of using a uniform delimiter for strings and then had to solve
       | their own problem. Maybe it was hard to use backtick in the 70's
       | and 80's, but today[2] you could use backtick to start a string
       | and a single quote to end it.
       | 
       | What do C-like programming languages use single quotes for? To
       | quote characters. Why do you need to quote characters? I've never
       | seen a literal character which needed an "end character" marker.
       | 
       | Raw strings would still be useful but you wouldn't need raw
       | strings just to do a very basic thing like make a string which
       | has typewriter quotes in it.
       | 
       | Of course this was for C-like languages. Don't even get me
       | started on shell and related languages where basically everything
       | is a string and you have to make a single-quote/double-quote
       | battle plan before doing anything slightly nested.
       | 
       | [1] https://lore.kernel.org/git/vpq3808p40o.fsf@anie.imag.fr/
       | 
       | [2] Notwithstanding us Europeans that use a dead-key keyboard
       | layout where you have to type twice to get one measly backtick
       | (not that I use those)
        
         | pwdisswordfishz wrote:
         | > The comment character in Git commit messages can be a problem
         | when you insist on prepending your commits with some "id" and
         | the id starts with `#`
         | 
         | https://git-scm.com/docs/git-commit#Documentation/git-commit...
        
           | keybored wrote:
           | See "commit cleanup".
           | 
           | There's surprising layers to this. That the reporter in that
           | thread says that git-commit will "happily" accept `#` in
           | commit messages is half-true: it will accept it if you don't
           | edit the message since the `default` cleanup (that you linked
           | to) will not remove comments if the message is given through
           | things like `-m` and not an editing session. So `git commit
           | -m'#something' is fine. But then try to do rebase and cherry-
           | pick and whatever else later, maybe get a merge commit
           | message with a commented "conflicted" files. Well it can get
           | confusing.
        
         | kragen wrote:
         | > _Maybe it was hard to use backtick in the 70's and 80's, but
         | today[2] you could use backtick to start a string and a single
         | quote to end it._
         | 
         | That's how quoting works by default in m4 and TeX, both defined
         | in the 70s. Unfortunately Unicode retconned the ASCII
         | apostrophe character ' to be a vertical line, maybe out of a
         | misguided deference to Microsoft Windows, and now we all have
         | to suffer the consequences. (Unless we're using Computer Modern
         | fonts or other fonts that predate this error, such as VGA font
         | ROM dumps.)
         | 
         | In the 70s and 80s, and into the current millennium on Unix,
         | `x' did look like 'x', but now instead it looks like dogshit.
         | Even if you are willing to require a custom font for
         | readability, though, that doesn't solve the problem; you need
         | some way to include an apostrophe in your quoted string!
         | 
         | As for end delimiters, C itself supports multicharacter
         | literals, which are potentially useful for things like
         | Macintosh type and creator codes, or FTP commands.
         | Unfortunately, following the Unicode botch theme, the standard
         | failed to define an endianness or minimum width for them, so
         | they're not very useful today. You can use them as enum values
         | if you want to make your memory dumps easier to read in the
         | debugger, and that's about it. I think Microsoft's compiler
         | botched them so badly that even that's not an option if you
         | need your code to run on it.
        
           | ygra wrote:
           | > Unfortunately Unicode retconned the ASCII apostrophe
           | character ' to be a vertical line
           | 
           | Unicode does not precribe the appearance of characters.
           | Although in the code chart1 it says >>neutral (vertical)
           | glyph with mixed usage<< (next to >>apostrophe-quote<< and
           | >>single quote<<), font vendors have to deal with this mixed
           | usage. And with Unicode the correct quotation marks have
           | their own code points, making it unnecessary to design fonts
           | where the ASCII apostrophe takes their form, but rendering
           | all other uses pretty ugly.
           | 
           | I would regard using ` and ' as paired quotation marks as a
           | hack from times when typographic expression was simply not
           | possible with the character sets of the day.
           | 
           | _________
           | 
           | 1                   0027 ' APOSTROPHE         = apostrophe-
           | quote (1.0)         = single quote         = APL quote
           | * neutral (vertical) glyph with mixed usage         * 2019 '
           | is preferred for apostrophe         * preferred characters in
           | English for paired quotation marks are 2018 ' & 2019 '
           | * 05F3 ' is preferred for geresh when writing Hebrew
           | - 02B9 ' modifier letter prime         - 02BC ' modifier
           | letter apostrophe         - 02C8 ' modifier letter vertical
           | line         - 0301 $ combining acute accent         - 030D $
           | combining vertical line above         - 05F3 ' hebrew
           | punctuation geresh         - 2018 ' left single quotation
           | mark         - 2019 ' right single quotation mark         -
           | 2032 ' prime         - A78C  latin small letter saltillo<<
        
           | keybored wrote:
           | > That's how quoting works by default in m4 and TeX, both
           | defined in the 70s.
           | 
           | Good point. And it was in m4[1] I saw that
           | backtick+apostrophe syntax. I would have probably not thought
           | of that possibility if I hadn't seen it there.
           | 
           | [1] Probably on Wikipedia since I have never used it
           | 
           | > Unfortunately Unicode retconned the ASCII apostrophe
           | character ' to be a vertical line, maybe out of a misguided
           | deference to Microsoft Windows, and now we all have to suffer
           | the consequences. (Unless we're using Computer Modern fonts
           | or other fonts that predate this error, such as VGA font ROM
           | dumps.)
           | 
           | I do think the vertical line looks subpar (and I don't use it
           | in prose). But most programmers don't seem bothered by it. :|
           | 
           | > In the 70s and 80s, and into the current millennium on
           | Unix, `x' did look like 'x', but now instead it looks like
           | dogshit.
           | 
           | Emacs tries to render it like 'x' since it uses
           | backtick+apostrophe for quotes. With some mixed results in my
           | experience.
           | 
           | > Even if you are willing to require a custom font for
           | readability, though, that doesn't solve the problem; you need
           | some way to include an apostrophe in your quoted string!
           | 
           | Aha, I honestly didn't even think that far. Seems a bit
           | restrictive to not be able to use possessives and
           | contractions in strings without escapes.
           | 
           | > As for end delimiters, C itself supports multicharacter
           | literals, which are potentially useful for things like
           | Macintosh type and creator codes, or FTP commands.
           | 
           | I should have made it clear that I was only considering
           | C-likes and not C itself. A language from the C trigraph days
           | can be excused. To a certain extent.
        
             | kragen wrote:
             | I'd forgotten about `' in Emacs documentation! That may be
             | influenced by TeX.
             | 
             | C multicharacter literals are unrelated to trigraphs.
             | Trigraphs were a mistake added many years later in the ANSI
             | process.
        
           | tom_ wrote:
           | See also: https://www.cl.cam.ac.uk/~mgk25/ucs/quotes.html
        
             | kragen wrote:
             | This is an excellent document. I disagree with its
             | normative conclusions, because I think being incompatible
             | with ASCII, Unix, Emacs, and TeX is worse than being
             | incompatible with ISO-8859-1, Microsoft Windows, and MacOS
             | 9, but it is an excellent reference for the factual
             | background.
        
         | shawa_a_a wrote:
         | The comment character is also configurable:
         | git config core.commentchar <char>
         | 
         | This is helpful where you want to use use say, markdown to have
         | tidily formatted commit messages make up your pull request body
         | too.
        
           | keybored wrote:
           | I want to try to set it to `auto` and see what spicy things
           | it comes up with.
        
         | samatman wrote:
         | There are no problems caused by using unary delimiters for
         | strings, because using paired delimiters for strings doesn't
         | solve the problems unary delimiters create.
         | 
         | By nature, strings contain arbitrary text. Paired delimiters
         | have one virtue over unary: they nest, but this virtue is only
         | evident when a syntax requires that they _must_ nest, and this
         | is not the case for strings. It 's but a small victory to
         | reduce the need for some sort of escaping, without eliminating
         | it.
         | 
         | Of the bewildering variety of partial solutions to the dilemma,
         | none fully satisfactory, I consider the `backtick quote'
         | pairing among the worst. Aside from the aesthetic problems,
         | which can be fixed with the right choice of font, the bare
         | apostrophe is much more common in plain text than an unmatched
         | double quote, and the convention does nothing to help.
         | 
         | This comes at the cost of losing a type of string, and backtick
         | strings are well-used in many languages, including by you in
         | your second paragraph. What we would get in return for this
         | loss is, nothing, because `don't' is just as invalid as 'don't'
         | and requires much the same solution. `This is `not worth it',
         | you see', especially as languages like to treat strings as
         | single tokens (many exceptions notwithstanding) and this
         | introduces a push-down to that parse for, again, no appreciable
         | benefit.
         | 
         | I do agree with you about C and character literals, however.
         | The close quote isn't needed and always struck me as somewhat
         | wasteful. 'a is cleaner, and reduces the odds of typing "a"
         | when you mean 'a'.
        
       | yen223 wrote:
       | select'select'select
       | 
       | is a perfectly valid SQL query, at least for Postgres.
       | 
       | Languages' approach to whitespace between tokens is all over the
       | place
        
       | notsylver wrote:
       | As soon as I saw this was part of llamafile I was hoping that it
       | would be used to limit LLM output to always be "valid" code as
       | soon as it saw the backticks, but I suppose most LLMs don't have
       | problems with that anyway. And I'm not sure you'd want something
       | like that automatically forcing valid code anyway
        
         | dilap wrote:
         | llama.cpp does support something like this -- you can give it a
         | grammar which restricts the set of available next tokens that
         | are sampled over
         | 
         | so in theory you could notice "```python" or whatever and then
         | start restricting to valid python code. (in least in theory,
         | not sure how feasible/possible it would be in practice w/ their
         | grammar format.)
         | 
         | for code i'm not sure how useful it would be since likely any
         | model that is giving you working code wouldn't be struggling w/
         | syntax errors anyway?
         | 
         | but i have had success experimentally using the feature to
         | drive fiction content for a game from a smaller llm to be in a
         | very specific format.
        
           | notsylver wrote:
           | yeah, ive used llama.cpp grammars before, which is why i was
           | thinking about it. i just think it'd be cool for llamafile to
           | do basically that, but with included defaults so you could
           | eg, require JSON output. it could be cool for prototyping or
           | something. but i dont think that would be too useful anyway,
           | most of the time i think you would want to restrict it to a
           | specific schema, so i can only see it being useful for
           | something like a tiny local LLM for code completion, but that
           | would just encourage valid-looking but incorrect code.
           | 
           | i think i just like the idea of restricting LLM output, it
           | has a lot of interesting use cases
        
             | dilap wrote:
             | gotchya. i do think that is a cool idea actually -- LLMs
             | tiny enough to do useful things with formally structured
             | output but not big enough to nail the structure ~100% is
             | probably not an empty set.
        
       | pwdisswordfishz wrote:
       | > Of all the languages, I've saved the best for last, which is
       | Ruby. Now here's a language whose syntax evades all attempts at
       | understanding.
       | 
       | TeX with its arbitrarily reprogrammable lexer: how adorable
        
         | fanf2 wrote:
         | Lisp reader macros allow you to program its lexer too.
        
           | skydhash wrote:
           | You can basically define a new language with a few lines of
           | code in Racket.
        
       | pansa2 wrote:
       | > _TypeScript, Swift, Kotlin, and Scala take string interpolation
       | to the furthest extreme of encouraging actual code being embedded
       | inside strings. So to highlight a string, one must count curly
       | brackets and maintain a stack of parser states._
       | 
       | Presumably this is also true in Python - IIRC the brace-delimited
       | fields within f-strings may contain arbitrary expressions.
       | 
       | More generally, this must mean that the lexical grammar of those
       | languages isn't regular. "Maintaining a stack" isn't part of a
       | finite-state machine for a regular grammar - instead we're in the
       | realm of pushdown automata and context-free grammars.
       | 
       | Is it even possible to support generalized string interpolation
       | within a strictly regular lexical grammar?
        
         | aphantastic wrote:
         | > Is it even possible to support generalized string
         | interpolation within a strictly regular lexical grammar?
         | 
         | Almost certainly not, a fun exercise is to attempt to devise a
         | Pumping tactic for your proposed language. If it doesn't exist,
         | it's not regular.
         | 
         | https://en.m.wikipedia.org/wiki/Pumping_lemma_for_regular_la...
        
         | fanf2 wrote:
         | Complicated interpolation can be lexed as a regular language if
         | you treat strings as three separate lexical things, eg in
         | JavaScript template literals there are,
         | `stuff${         }stuff${         }stuff`
         | 
         | so the ${ and } are extra closing and opening string
         | delimiters, leaving the nesting to be handled by the parser.
         | 
         | You need a lexer hack so that the lexer does not treat } as the
         | start of a string literal, except when the parser is inside an
         | interpolation but all nested {} have been closed.
        
       | irdc wrote:
       | I'd be interested to see a re-usable implementation of joe's[0]
       | syntax highlighting.[1] The format is powerful enough to allow
       | for the proper highlighting of Python f-strings.[2]
       | 
       | 0. https://joe-editor.sf.net/
       | 
       | 1. https://github.com/cmur2/joe-
       | syntax/blob/joe-4.4/misc/HowItW...
       | 
       | 2.
       | https://gist.github.com/irdc/6188f11b1e699d615ce2520f03f1d0d...
        
         | pama wrote:
         | Interestingly, python f-strings changed their syntax at version
         | 3.12, so highlighting should depend on the version.
        
           | irdc wrote:
           | It's just that nesting them arbitrarily is now allowed,
           | right? That shouldn't matter much for a mere syntax
           | highlighter then. And one could even argue that code that
           | relies on this too much is not really for human consumption.
        
             | pansa2 wrote:
             | Also, you can now use the same quote character that
             | encloses an f-string within the {} expressions. That could
             | make them harder to tokenize, because it makes it harder to
             | recognise the end of the string.
        
         | akira2501 wrote:
         | I've actually made several lexers and parsers based on the joe
         | DFA style of parsing. The state and transition syntax was
         | something that I always understood much more easily than the
         | standard tools.
         | 
         | The downside is your rulesets tend to get more verbose and are
         | a little bit harder to structure than they might ideally be in
         | other languages more suited towards the purpose, but I actually
         | think that's an advantage, as it's much easier to reason about
         | every production rule when looking at the code.
        
       | rererereferred wrote:
       | In the C# multiquoted strings, how does it know this:
       | Console.WriteLine("""""");        Console.WriteLine("""""");
       | 
       | Are 2 triplequoted empty strings and not one
       | "\nConsole.WriteLine(" sixtuplequoted string?
        
         | ygra wrote:
         | The former, I'd say.
         | 
         | https://learn.microsoft.com/en-us/dotnet/csharp/programming-...
         | 
         | For a multi-line string the quotes have to be on their own
         | line.
        
         | Joker_vD wrote:
         | If the opening quotes are followed by anything that is not a
         | whitespace before the next new-line (or EOF), then it's a
         | single-line string.
         | 
         | I imagine implementing those things took several iterations :)
        
         | yen223 wrote:
         | It's a syntax error!                 Unterminated raw string
         | literal.
         | 
         | https://replit.com/@Wei-YenYen/DistantAdmirableCareware#main...
        
           | Joker_vD wrote:
           | Ah, so there is no backtracking in lexer for this case. Makes
           | sense.
        
       | ygra wrote:
       | As for C#'s triple-quoted strings, they actually came from Java
       | before and C# ended up adopting the same or almost the same
       | semantics. Including stripping leading whitespace.
        
       | pdw wrote:
       | Some random things that the author seem to have missed:
       | 
       | > but TypeScript, Swift, Kotlin, and Scala take string
       | interpolation to the furthest extreme of encouraging actual code
       | being embedded inside strings
       | 
       | Many more languages support that:                   C#
       | $"{x} plus {y} equals {x + y}"         Python         f"{x} plus
       | {y} equals {x + y}"         JavaScript     `${x} plus ${y} equals
       | ${x + y}`         Ruby           "#{x} plus #{y} equals #{x + y}"
       | Shell          "$x plus $y equals $(echo "$x+$y" | bc)"
       | Make :)        echo "$(x) plus $(y) equals $(shell echo "$x+$y" |
       | bc)"
       | 
       | > Tcl
       | 
       | Tcl is funny because comments are only recognized in code, and
       | since it's a homoiconic, it's very hard to distinguish code and
       | data. { } are just funny string delimiters. E.g.:
       | xyzzy {#hello world}
       | 
       | Is xyzzy a command that takes a code block or a string? There's
       | no way to tell. (Yes, that means that the Tcl tokenizer/parser
       | cannot discard comments: only at evaluation time it's possible to
       | tell if something is a comment or not.)
       | 
       | > SQL
       | 
       | PostgreSQL has the very convenient dollar-quoted strings:
       | https://www.postgresql.org/docs/current/sql-syntax-lexical.h...
       | E.g. these are equivalent:                   'Dianne''s horse'
       | $$Dianne's horse$$         $SomeTag$Dianne's horse$SomeTag$
        
         | autarch wrote:
         | Perl lets you do this too:                   my $foo = 5;
         | my $bar = 'x';         my $quux = "I have $foo $bar\'s: @{[$bar
         | x $foo]}";         print "$quux\n";
         | 
         | This prints out:                   I have 5 x's: xxxxx
         | 
         | The "@{[...]}" syntax is abusing Perl's ability to interpolate
         | an _array_ as well as a scalar. The inner "[...]" creates an
         | array reference and the outer "@{...}" dereferences it.
         | 
         | For reasons I don't remember, the Perl interpreter allows
         | arbitrary code in the inner "[...]" expression that creates the
         | array reference.
        
           | Izkata wrote:
           | > For reasons I don't remember, the Perl interpreter allows
           | arbitrary code in the inner "[...]" expression that creates
           | the array reference.
           | 
           | ...because it's an array value? Aside from how the languages
           | handle references, how is that part any different from, for
           | example, this in python:                 >>> [5 * 'x']
           | ['xxxxx']
           | 
           | You can put (almost) anything there, as long as it's an
           | expression that evaluates to a value. The resulting value is
           | what goes into the array.
        
             | autarch wrote:
             | I understand that's constructing an array. What's a bit odd
             | is that the interpreter allows you to string interpolate
             | any expression when constructing the array reference inside
             | the string.
        
               | Izkata wrote:
               | It's not...? Well, not directly: It's string
               | interpolating an array of values, and the array is
               | constructed using values from the results of expressions.
               | These are separate features that compose nicely.
        
               | JadeNB wrote:
               | > What's a bit odd is that the interpreter allows you to
               | string interpolate any expression when constructing the
               | array reference inside the string.
               | 
               | Why? Surely it is easier for both the language and the
               | programmer to have a rule for what you can do when
               | constructing references to anonymous arrays, without
               | having to special case whether that anonymous array is or
               | is not in a string (or in any one of the many other
               | contexts in which such a construct may appear in Perl).
        
           | weinzierl wrote:
           | You also don't need quotes around strings (barewords). So
           | my $bar = x;
           | 
           | should give the same result.
           | 
           | Good luck with lexing that properly.
           | 
           | https://perlmaven.com/barewords-in-perl
        
             | shawn_w wrote:
             | If you're writing anything approaching decent perl that
             | won't be accepted.
        
               | emmelaich wrote:
               | "use strict" will prevent it and I think strict will be
               | assumed/default soon.
        
               | JadeNB wrote:
               | As of Perl 5.12, `use`ing a version (necessary to ensure
               | availability of some of the newer features) automatically
               | implies `use strict`.
               | 
               | https://perldoc.perl.org/strict#HISTORY
        
               | weinzierl wrote:
               | Doesn't really matter for a syntax highlighter, because
               | it is out of your control what you get. For the llamafile
               | highlighter even more so since it supports other legacy
               | quirks, like C trigraphs as well.
        
         | layer8 wrote:
         | > actual code being embedded inside strings
         | 
         | My view on this is that it shouldn't be interpreted as code
         | being embedded inside strings, but as a special form of string
         | concatenation syntax. In turn, this would mean that you can
         | nest the syntax, for example:                   "foo {
         | toUpper("bar { x + y } bar") } foo"
         | 
         | The individual tokens being (one per line):
         | "foo {         toUpper         (         "bar {         x
         | +         y         } bar"         )         } foo"
         | 
         | If `+` does string concatenation, the above would effectively
         | be equivalent to:                   "foo " + toUpper("bar " +
         | (x + y) + " bar") + " foo"
         | 
         | I don't know if there is a language that actually works that
         | way.
        
           | panzi wrote:
           | Indeed in some of the listed languages you can nest it like
           | that, but in others (e.g. Python) you can't. I would guess
           | they deliberately don't want to enable that and it's not a
           | problem in their parser or something.
        
             | layer8 wrote:
             | Even when nesting is disallowed, my point is that I find it
             | preferable to not view it (and syntax-highlight it) as a
             | "special string" with embedded magic, but as multiple
             | string literals with just different delimiters that allow
             | omitting the explicit concatenation operator, and normal
             | expressions interspersed in between. I think it's important
             | to realize that it is really just very simple syntactic
             | sugar for normal string concatenation.
        
               | Timwi wrote:
               | While you're conceptually right, in practice I think it
               | bears mentioning that in C# the two syntaxes compile
               | differently. This is because C#'s target platform, the
               | .NET Framework, has always had a function called
               | `string.Format` that lets you write this:
               | var str = string.Format("{0} is {1} years old.", name,
               | age);
               | 
               | When interpolated strings were introduced later, it was
               | natural to have them compile to this instead of
               | concatenation.
        
               | layer8 wrote:
               | There's no reason in principle why                   name
               | + " is " + age + " years old."
               | 
               | couldn't compile to exactly the same. (Other than maybe
               | `string.Format` having some additional customizable
               | behavior, I don't know C# that well.)
        
               | epcoa wrote:
               | Like python, and Rust with the format! macro (which
               | doesn't even support arbitrary expressions), C# the full
               | syntax for interpolated/formatted strings is this: {<inte
               | rpolationExpression>[,<alignment>][:<formatString>]}, ie
               | there is more going on then just a simple wrapper around
               | concat or StringBuilder.
        
               | ygra wrote:
               | When not using the format specifiers or alignment it will
               | indeed compile to just string.Concat (which is also what
               | the + operator for strings compiles to). Similar to C
               | compilers choosing to call pits instead of printf if
               | there is nothing to be formatted.
        
               | epcoa wrote:
               | If it's treated strictly as simple concatenation
               | syntactic sugar then you are allowing something like
               | print("foo { func() ); Which seems janky af.
               | 
               | > just very simple syntactic sugar for normal string
               | concatenation.
               | 
               | Maybe. There's also possibly a string conversion. It
               | seems reasonable to want to disallow implicit string
               | conversion in a concatenation operator context
               | (especially if overloading +) while allowing it in the
               | interpolation case.
        
               | layer8 wrote:
               | I failed to mention the balancing requirement, that
               | should of course remain. But it's an artificial
               | requirement, so to speak, that is merely there to double-
               | check the programmer's intent. The compiler/parser
               | wouldn't actually care (unlike for an arithmetic
               | expression with unbalanced parentheses, or scope blocks
               | with unbalanced braces), the condition is only checked
               | for the programmer's benefit.
               | 
               | > here's also possibly a string conversion. It seems
               | reasonable to want to disallow implicit string conversion
               | in a concatenation operator context (especially if
               | overloading +) while allowing it in the interpolation
               | case.
               | 
               | Many languages have a string contenation operator that
               | does implicit conversion to string, while still having a
               | string interpolation syntax like the above. It's kind of
               | my point that both are much more similar to each other
               | than many people seem to realize.
        
             | Tarean wrote:
             | As of python 3.6 you can nest fstrings. Not all formatters
             | and highlighters have caught up, though.
             | 
             | Which is fun, because correct highlighting depends on
             | language version. Haskell has similar problems where
             | different compiler flags require different parsers. Close
             | enough is sufficient for syntax highlighting, though.
             | 
             | Python is also a bit weird because it calls the format
             | methods, so objects can intercept and react to the format
             | specifiers in the f-string while being formatted.
        
               | panzi wrote:
               | I didn't mean nested f-strings. I mean this is a syntax
               | error:                   >>> print(f"foo {"bar"}")
               | SyntaxError: f-string: expecting '}'
               | 
               | Only this works:                   >>> print(f"foo
               | {'bar'}")         foo bar
        
               | pdw wrote:
               | You're using an old Python version. On recent versions,
               | it's perfectly fine:                   Python 3.12.7
               | (main, Oct  3 2024, 15:15:22) [GCC 14.2.0] on linux
               | Type "help", "copyright", "credits" or "license" for more
               | information.         >>> print(f"foo {"bar"}")
               | foo bar
        
           | epcoa wrote:
           | > "foo { ...
           | 
           | That should probably not be one token.
           | 
           | > My view on this is that it shouldn't be interpreted as code
           | being embedded inside strings
           | 
           | I'm not sure exactly what you're proposing and how it is
           | different. You still can't parse it as a regular lexical
           | grammar.
           | 
           | How does this change how you highlight either?
           | 
           | Whatever you call it, to the lexer it is a special string, it
           | has to know how to match it, the delimiters are materially
           | different than concatenation.
           | 
           | I might be being dense but I'm not sure what's formally
           | distinct.
        
             | layer8 wrote:
             | > > "foo { ...
             | 
             | > That should probably not be one token.
             | 
             | It's exactly the point that this is one token. It's a
             | string literal with opening delimiter `"` and closing
             | delimiter `{`, and that whole token itself serves as a kind
             | of opening "brace". Alternatively, you can see `{` as a
             | contraction of `" +`. Meaning, aside from the brace
             | balancing requirement, `"foo {` does the same a `"foo " +`
             | would.
             | 
             | Still alternatively, you could imagine a language that
             | concatenates around string literals by default, similar to
             | how C behaves for sequences of string literals. In C,
             | "foo" "bar" "baz"
             | 
             | is equivalent to                   "foobarbaz"
             | 
             | Similarly, you could imagine a language where
             | "foo" some_variable "bar"
             | 
             | would perform implicit concatenation, without needing an
             | explicit operator (as in `"foo" + x + "bar"`). And then
             | people might write it without the inner whitespace, as:
             | "foo"some_variable"bar"
             | 
             | My point is that                   "foo{some_variable}bar"
             | 
             | is really just that (plus a condition requiring balanced
             | pairs of braces). You can also re-insert the spaces for
             | emphasis:                   "foo{ some_variable }bar"
             | 
             | The fact that people tend to think of `{some_variable}` as
             | an entity is sort-of an illusion.
             | 
             | > How does this change how you highlight either?
             | 
             | You would highlight the `"...{`, `}...{`, and `}..."` parts
             | like normal string literals (they just use curly braces
             | instead of double quotes at one or both ends), and
             | highlight the inner expressions the same as if they weren't
             | surrounded by such literals.
        
               | epcoa wrote:
               | > It's exactly the point that this is one token.
               | 
               | Fair enough. The point, as you have acknowledged, being
               | that unlike + you have to treat { specially for balancing
               | (and separately from the ").
               | 
               | > The fact that people tend to think of `{some_variable}`
               | as an entity is sort-of an illusion.
               | 
               | I guess. I just don't know what being an illusion means
               | formally. It's not an illusion to the person that has to
               | implement the state machine that balances the delimiters.
               | 
               | > You would highlight the `"...{`, `}...{`, and `}..."`
               | parts like normal string literals (they just use curly
               | braces instead of double quotes at one or both ends), and
               | highlight the inner expressions the same as if they
               | weren't surrounded by such literals
               | 
               | Emacs does it this way FWIW. But I'm not sure how
               | important it is to dictate that the brace can't be a
               | different color.
               | 
               | In any event, I can agree your design is valid (Kotlin
               | works this way), but I don't necessarily agree it is any
               | more valid than say how Python does it where there can
               | format specifiers, implicit conversion to string is
               | performed whereas not with concatenation. I'm not seeing
               | the clear definitive advantage of interpolated strings
               | being an equivalent to concatenation vs some other type
               | of method call.
               | 
               | The other detail is order of evaluation or sequencing.
               | String concat may behave differently. Not sure I agree it
               | is wrong, because at the end of the day it is distinct
               | looking syntax. Illusion or not, it looks like a neatly
               | enclosed expression, and concatenation looks like
               | something else. That they might parse, evaluate or behave
               | different isn't unreasonable.
        
         | panzi wrote:
         | Is this a bash-ism?                   "$x plus $y equals
         | $((x+y))"
        
           | jonahx wrote:
           | This works in "sh" as well for me.
        
             | panzi wrote:
             | On some systems (like on mine) sh is just a link to bash,
             | so I couldn't test it.
        
               | Izkata wrote:
               | Isn't bash supposed to act like sh when executed with
               | that name?
        
               | saagarjha wrote:
               | It still has bashisms
        
           | jwilk wrote:
           | No, it's portable shell syntax.
        
           | LukeShu wrote:
           | "$((" arithmetic expansion is POSIX (XCU 2.6.4 "Arithmetic
           | Expansion").
           | 
           | But if I'm not mistaken, it originated in csh.
        
           | susam wrote:
           | > Is this a bash-ism?
           | 
           | > "$x plus $y equals $((x+y))"
           | 
           | No, it is specified in POSIX: https://pubs.opengroup.org/onli
           | nepubs/9699919799/utilities/V...
        
         | therein wrote:
         | > PostgreSQL has the very convenient dollar-quoted strings
         | 
         | I did not know that. Today I learned.
        
         | sundarurfriend wrote:
         | > Many more languages support that:
         | 
         | Julia as well:                   Julia    "$x plus $y equals
         | $(x+y)"
        
         | thesz wrote:
         | VHDL
         | 
         | There is a record constructor syntax in VHDL using attribute
         | invocation syntax: RECORD_TYPE'(field1expr, ..., fieldNexpr).
         | This means that if your record has a first field a subtype of a
         | character type, you can get record construction expression like
         | this one: REC'('0',1,"10101").
         | 
         | Good luck distinguishing between '(' as a character literal and
         | "'", "(" and "'0'" at lexical level.
         | 
         | Haskell.
         | 
         | Haskell has context-free syntax for bracketed ("{-" ... "-}")
         | comments. Lexer has to keep bracketed comment syntax balanced
         | (for every "{-" there should be accompanying "-}" somewhere).
        
         | 1vuio0pswjnm7 wrote:
         | Shell "$x plus $y equals $((x+y))"
         | 
         | Shell "$x plus $y equals $((expr $x + $y))"
        
         | Izkata wrote:
         | Make :)        echo "$(x) plus $(y) equals $(shell echo "$x+$y"
         | | bc)"
         | 
         | I'm guessing this is the reason for the :) but to be clear for
         | anyone else: Make is only doing half of the work, whatever
         | comes after "shell" is being passed to another executable, then
         | make captures its stdout and interpolates that. The other
         | executable is "sh" by default but can be changed to whatever.
        
         | mbo wrote:
         | > Scala
         | 
         | Note about Scala's string interpolation. They can be used as
         | pattern match targets.                 val s"${a} + ${b}" = "1
         | + 2";       println(a) // 1       println(b) // 2
        
         | vidarh wrote:
         | Ruby takes this to 100. As much as a I love Ruby, this is valid
         | Ruby, and I can't defend this:                   puts "This is
         | #{<<HERE.strip} evil"         incredibly          HERE
         | 
         | Just to combine the string interpolation with her concern over
         | Ruby heredocs.
         | 
         | My other favorite evil quirk in Ruby is that whitespace is a
         | valid quote character in Ruby. The string (without the quotes)
         | "% hello " is a quoted string containing "hello" (without the
         | quotes), as "%" in contexts where there is no left operand
         | initiates a quoted string and the next characters indicates the
         | type of quotes. This is great when you do e.g. "%(this is a
         | string)" or "%{this is a string}". It's not so great if you use
         | space (I've _never_ seen that in the wild, so it 'd be nice if
         | it was just removed - even irb doesn't handle it correctly)
        
           | jart wrote:
           | https://pbs.twimg.com/media/GbEfj6fbQAQRUB7?format=png&name=.
           | ..
           | 
           | That's so going in the blog post later today.
        
             | vidarh wrote:
             | Heh. I love Ruby, but, yes, the parser is "interesting",
             | for values of interesting left undefined for its high
             | obscenity content.
        
           | mdaniel wrote:
           | And don't overlook the fact that the bare-world, or its
           | "HERE" friend, are still in an interpolation context, so...
           | puts "hello #{<<onoz.strip} world"         recursion is
           | #{<<onoz.strip}         recursive         onoz         onoz
           | puts "that was fun"
           | 
           | yields                 hello recursion is recursive world
           | that was fun
           | 
           | and then there's its backtick friend                   puts
           | "hello #{<<`onoz`.strip} world"         date -u         onoz
           | 
           | coughs up                   hello Sun Nov  3 17:25:32 UTC
           | 2024 world
           | 
           | and for those trying out your percent-space trick, be aware
           | that it only tolerates such a thing in a standalone
           | expression context so                 puts (% hello )+"
           | world"       # or       x = % hello #       puts x
           | 
           | because when I tried it "normally" I got                   $
           | /usr/bin/ruby -e 'puts % hello  + "world"'
           | -e:1:in `<main>': undefined local variable or method `hello'
           | for main:Object (NameError)         $ /usr/bin/ruby -v
           | ruby 2.6.10p210 (2022-04-12 revision 67958)
           | [universal.x86_64-darwin21]
           | 
           | but, at the intersection is "ruby parsing is the 15th circle
           | of hell"                   ruby -e 'puts (% #{<<FOO.strip}
           | )+ " world"         hello         FOO         '
        
         | cryptonector wrote:
         | jq: "\\("hello" + "world")!!"
         | 
         | I wish PG had dollar-bracket quoting where you have to use the
         | closing bracket to close, that way vim showmatch would work
         | trivially. Something like ${...}$.
        
         | bastawhiz wrote:
         | Python f-strings are kind of wild. They can even contain
         | comments! They also have slightly different rules for parsing
         | certain kinds of expressions, like := and lambdas. And until
         | fairly recently, strings inside the expressions couldn't use
         | the quote type of the f-string itself (or backslashes).
        
       | __MatrixMan__ wrote:
       | This was a fun read, but it left me a bit more sympathetic to the
       | lisp perspective, which (if I've understood it) is that syntax,
       | being not an especially important part of a language, is more of
       | a hurdle than a help, and should be as simple and uniform as
       | possible so we can focus on other things.
       | 
       | Which is sort of ironic because learning how to do structural
       | editing on lisps has absolutely been more hurdle than help so
       | far, but I'm sure it'll pay off eventually.
        
         | mqus wrote:
         | Having a simple syntax might be fine for computers but syntax
         | is mainly designed to be read and written by humans. Having a
         | simple one like lisp then just makes syntactic discussions a
         | semantic problem, just shifting the layers.
         | 
         | And I think an complex syntax is far easier to read and write
         | than a simple syntax with complex semantics. You also get a
         | faster feedback loop in case the syntax of your code is wrong
         | vs the semantics (which might be undiscovered until runtime).
        
           | drewr wrote:
           | I don't understand your distinction between syntax and
           | semantics. If the semantics are complex, wouldn't that mean
           | the syntax is thus complex?
        
             | SuperCuber wrote:
             | lisp's syntax is simple - its just parenthesis to define a
             | list, first element of a list is executed as a function.
             | 
             | but for example a language like C has many different
             | syntaxes for different operations, like function
             | declaration or variable or array syntax, or if/switch-case
             | etc etc.
             | 
             | so to know C syntax you need to learn all these different
             | ways to do different things, but in lisp you just need to
             | know how to match parenthesis.
             | 
             | But of course you still want to declare variables, or have
             | if/else and switch case. So you instead need to learn the
             | builtin macros (what GP means by semantics) and their
             | "syntax" that is technically not part of the language's
             | syntax but actually is since you still need all those
             | operations enough that they are included in the standard
             | library and defining your own is frowned upon.
        
               | kryptiskt wrote:
               | Lisp has way more syntax, that doesn't cover any of the
               | special forms. Knowing about application syntax doesn't
               | help with understanding `let` syntax. Even worse, with
               | macros, the amount of syntax is open-ended. That they all
               | come in the form of S-expressions doesn't help a lot in
               | learning them.
        
             | skydhash wrote:
             | Most languages' abstract machines expose a very simple API,
             | it's up to the language to add useful constructs to help us
             | write code more efficiently. Languages like Lisp start with
             | a very simple syntax, then add those constructs with the
             | language itself (even though those can be fixed using a
             | standard), others just add it through the syntax. These
             | constructs plus the abstract machine's operations form the
             | semantics, syntax is however the language designer decided
             | to present them.
        
           | __MatrixMan__ wrote:
           | Jury's out re: whether I feel this in my gut. Need more time
           | with the lisps for that. But re: cognitive load maybe it goes
           | like:
           | 
           | 1. 1 language to rule them all, fancy syntax
           | 
           | 2. Many languages, 1 simple syntax to rule them all
           | 
           | 3. Many languages and many fancy syntaxes
           | 
           | Here in the wreckage of the tower of babel, 1. isn't really
           | on the table. But 2. might have benefits because the
           | inhumanity of the syntax need only be confronted once. The
           | cumulative cost of all the competing opinionated fancy
           | syntaxes may be the worst option. Think of all the hours lost
           | to tabs vs spaces or braces vs whitespace.
        
             | dartos wrote:
             | I think 3 is not only a natural state, but the best state.
             | 
             | I don't think we can have 1 language that satisfies the
             | needs of all people who write code, and thus, we can't have
             | 1 syntax that does that either.
             | 
             | 3 seems the only sensible solution to me, and we have it.
        
               | __MatrixMan__ wrote:
               | I dunno, here in 3 the hardest part of learning a
               | language has little to do with the language itself and
               | more to do with the ecosystem of tooling around that
               | language. I think we could more easily get on to the
               | business of using the right language for the job if more
               | of that tooling was shared. If each language, for
               | instance did not have it's own package manager, its own
               | IDE, its own linters and language servers all with their
               | own idiosyncrasies arising not from deep philosophical
               | differences of the associated language but instead from
               | accidental quirks of perspective from whoever decided
               | that their favorite language needed a new widget.
               | 
               | I admire the widget makers, especially those wrangling
               | the gaps between languages. I just wish their work could
               | be made easier.
        
               | skydhash wrote:
               | I really like the Linux package managers. If you're going
               | to write an application that will run on some system,
               | it's better to bake dependencies into it. And with
               | virtualization and containerization, the system is not
               | tied to a physical machine. I've been using containers
               | (incus) more and more for real development purposes as I
               | can use almost the same environment to deploy. I don't
               | care much about the IDE, but I'm glad we have LSP, Tree-
               | sitter, and DAP. The one thing I do not like is the
               | proliferation of tooling version manager (NVM,..) instead
               | of managing the environment itself (tied to the project).
        
           | andai wrote:
           | This is interesting. My first thought was that a language
           | where more meaning is expressed in syntax could catch more
           | errors at compile time. But there seems to be no reason why
           | meaning encoded in semantics could not also be caught at
           | compile time.
           | 
           | The main benefit of putting things in the syntax seems to be
           | that many errors would become visually obvious.
        
           | broken-kebab wrote:
           | The problem with this statement is that it assumes parsing-
           | easiness as something universal, and stable. And this is
           | certainly not true. You may believe syntax A is so much
           | easier simply because it's the syntax you have been dealing
           | with most of your career thus your brain is trained for it.
           | On top of it a particular task can make a lot of difference:
           | most people would agree that regex is simplification versus
           | writing the same logic in usual if-then way for pattern
           | matching in strings, but I'm not sure many would like to have
           | their whole programs looking that way (but even that could be
           | subjective, see APL).
        
           | James_K wrote:
           | I've always thought these complaints are really just a
           | reflection of how stuck we are in the C paradigm. The idea
           | that you have to edit programs as text is outdated IMO. It
           | should be that your editor operates on the syntax tree of the
           | source code. Once you do that, the code can be displayed in
           | any way.
        
             | mdaniel wrote:
             | I also believe this, and we're actually about half way
             | there via MPS <https://github.com/JetBrains/MPS#readme> but
             | I'm _pretty sure_ that dream is dead until this LLM hype
             | blows over, since LLMs are not going to copy-paste syntax
             | trees until the other dream of a universal representation
             | materializes[1]
             | 
             | 1: There have been _several_ attempts at Universal ASTs,
             | including (unsurprisingly) a JVM-centric one from JetBrains
             | https://github.com/JetBrains/intellij-
             | community/blob/idea/24...
        
         | nlitened wrote:
         | I am surprised to hear that structural editing has been a
         | hurdle for you, and I think I can offer a piece of advice. I
         | also used to be terrified by its apparent complexity, but later
         | found out that one just needs to use parinfer and to know key
         | bindings for only three commands: slurp, barf, and raise.
         | 
         | With just these four things you will be 95% there, enjoying the
         | fruits of paredit without any complexity -- all the remaining
         | tricks you can learn later when you feel like you're fluent.
        
           | __MatrixMan__ wrote:
           | Thanks very much for the advice, it's timely.
           | 
           | <rant> It's not so much the editing itself but the
           | unfamiliarity of the ecosystem. It seems it's a square-peg
           | I've been crafting a round hole of habits for it:
           | 
           | I guess I should use emacs? How to even configure it such
           | that these actions are available? Or maybe I should write a
           | plugin for helix so that I can be in a familiar environment.
           | Oh, but the helix plugin language is a scheme, so I guess
           | I'll use emacs until I can learn scheme better and then write
           | that plugin. Oh but emacs keybinds are conflicting with what
           | I've configured for zellij, maybe I can avoid conflicts by
           | using evil mode? Oh ok, emacs-lisp, that's a thing. Hey symex
           | seems like it aligns with my modal brain, oh but there goes
           | another afternoon of fussing with emacs. Found and reported a
           | symex "bug" but apparently it only appears in nix-governed
           | environments so I guess I gotta figure out how to report the
           | packaging bug (still todo). Also, I guess I might as well
           | figure out how to get emacs to evaluate expressions based on
           | which ones are selected, since that's one of the fun things
           | you can do in lisps, but there's no plugin for the scheme
           | that helix is using for its plugin language (which is why I'm
           | learning scheme in the first place), but it turns out that AI
           | is weirdly good at configuring emacs so now my emacs config
           | contains most that that plugin would entail. Ok, now I'm
           | finally ready to learn scheme, I've got this big list of new
           | actions to learn: https://countvajhula.com/2021/09/25/the-
           | animated-guide-to-sy.... Slurp, barf, and raise you say?
           | excellent, I'll focus on those.
           | 
           | I'm not actually trying to critique the unfamiliar space.
           | These are all self inflicted wounds: me being persnickety
           | about having it my way. It's just usually not so difficult to
           | use something new and also have it my way.</rant>
        
             | xenophonf wrote:
             | I never bothered with structural editing on Emacs. I just
             | use the sentence/paragraph movement commands. M-a, M-e,
             | M-n, M-p, M-T, M-space, etc.
        
             | nlitened wrote:
             | To be fair, I am not a "lisper" and I don't know Emacs at
             | all. I am just a Clojure enjoyer who uses IntelliJ +
             | Cursive with its built-in parinfer/paredit.
        
             | pxc wrote:
             | > Oh but emacs keybinds are conflicting with what I've
             | configured for zellij,
             | 
             | Don't do that. ;)
             | 
             | Emacs is a graphical application! Don't use it in the
             | terminal unless you really have to (i.e., you're using it
             | on a remote machine and TRAMP will not do).
             | 
             | > it turns out that AI is weirdly good at configuring emacs
             | 
             | I was just chatting with a friend about this. ChatGPT seems
             | to be much better at writing ELisp than many other
             | languages I've asked it to work with.
             | 
             | Also while you're playing with it, you might be interested
             | in checking out kakoune.el or meow, which provide modal
             | editing in Emacs but with the selection-first ordering for
             | commands, like in Kakoune and Helix rather than the old vi
             | way.
             | 
             | PS: symex looks really interesting! Hadn't been that one
        
               | cenamus wrote:
               | Well, elisp probably accounts for like 85% of the lisp
               | code on GH and co, so that'd make sense
        
         | fanf2 wrote:
         | Lisp has reader macros which allow you to reprogram its lexer.
         | Lisp macros allow you to program the translation from the
         | visible structure to the parse tree.
         | 
         | For example, https://pyret.org/
         | 
         | It really isn't simple or necessarily uniform.
        
           | __MatrixMan__ wrote:
           | I've heard that certain lisps (Common Lisp comes up when I
           | search for reader macros) allow for all kinds of tinkering
           | with themselves. But the ability of one to make itself not a
           | lisp anymore, while interesting, doesn't seem to say much
           | about the merits of sticking to s-expressions, except maybe
           | to point out that somebody once decided not to.
        
             | lispm wrote:
             | Reader macros are there to program and configure the
             | _reader_. The _reader_ is responsible for reading
             | s-expressions into internal data structures. There are
             | basically two main uses of reader-macros: data structures
             | and reader control.
             | 
             | A CL implementation will implement reading lists, symbols,
             | numbers, arrays, strings, structures, characters,
             | pathnames, ... via reader macros. Additionally the reader
             | implements various forms of control operations: conditional
             | reading, reading and evaluation, circular datastructures,
             | quoting and comments.
             | 
             | This is user programmable&configurable. Most uses will be
             | in the two above categories: data structure syntax and
             | control. For example we could add a syntax for hash tables
             | to s-expressions. An example for a control extension would
             | be to add support for named readtables. For example a
             | Common Lisp implementation could add a readtable for
             | reading s-expressions from Scheme, which has a slightly
             | different syntax.
             | 
             | Reader macros were optimized for implementing
             | s-expressions, thus the mechanism isn't that convenient as
             | a lexer/parser for actual programming languages. It's a a
             | bit painful to do so, but possible.
             | 
             | A typical reader macro usage, beyond the usage described
             | above, is one which implements a different token or
             | expression syntax. For example there are reader macros
             | which parse infix expressions. This might be useful in Lisp
             | code where arithmetic expressions can be written in a more
             | conventional infix syntax. The infix reader macro would
             | convert infix expressions into prefix data.
        
           | lispm wrote:
           | Is Pyret based on _reader macros_? I would think it 's much
           | easier to use a syntax parser for that.
        
       | kazinator wrote:
       | I don't think it's easy to write a good syntax coloring engine
       | like the one in Vim.
       | 
       | Syntax coloring has to handle context: different rules for
       | material nested in certain ways.
       | 
       | Vim's syntax higlighter lets you declare two kinds of items:
       | matches and regions. Matches are simpler lexical rules, whereas
       | regions have separate expressions for matching the start and end
       | and middle. There are ways to exclude leading and trailing
       | material from a region.
       | 
       | Matches and regions can declare that they are contained. In that
       | case they are not active unless they occur in a containing
       | region.
       | 
       | Contained matches declare which regions contain them.
       | 
       | Regions declare which other regions they contain.
       | 
       | That's the basic semantic architecture; there are bells and
       | whistles in the system due to situations that arise.
       | 
       | I don't think even Justine could develop that in an interview,
       | other than as an overnight take home.
        
         | kazinator wrote:
         | Here is an example of something hard to handle: TXR language
         | with embedded TXR Lisp.
         | 
         | This is the "genman" script which takes the raw output of a
         | manpage to HTML converter, and massages it to form the HTML
         | version of the TXR manual:
         | 
         | https://www.kylheku.com/cgit/txr/tree/genman.txr
         | 
         | Everything that is white (not colored) is literal template
         | material. Lisp code is embedded in directives, like @(do ...).
         | In this scheme, TXR keywords appear purple, TXR Lisp ones
         | green. They can be the same; see the (and ...) in line 149,
         | versus numerous occurrences of @(and).
         | 
         | Quasistrings contain nested syntax: see 130 where `<a href ..>
         | ... </a>` contains an embedded (if ...). That could itself
         | contain a quasistring with more embedded code.
         | 
         | TXR's _txr.vim " and _tl.vim* syntax definition files are both
         | generated by this:
         | 
         | https://www.kylheku.com/cgit/txr/tree/genvim.txr
        
         | saghm wrote:
         | Naively, I would have assumed that the "correct" way to write a
         | syntax highlighter would be to parse into an AST and then
         | iterate over the tokens and update the color of a token based
         | on the type of node (and maybe just tracking a diff to avoid
         | needing to recolor things that haven't changed). I'm guessing
         | that if this isn't done, it's for efficiency reasons (e.g. due
         | to requiring parsing the whole file to highlight rather than
         | just the part currently visible on the screen)?
        
           | Someone wrote:
           | > I would have assumed that the "correct" way to write a
           | syntax highlighter would be to parse into an AST and then
           | [...] I'm guessing that if this isn't done, it's for
           | efficiency reasons
           | 
           | It's not only running time, but also ease of implementation.
           | 
           | A good syntax highlighter should do a decent job highlighting
           | both valid and invalid programs (rationale: in most (editor,
           | language) pairs, writing a program involves going through
           | moments where the program being written isn't a valid
           | program)
           | 
           | If you decide to use an AST, that means you need to have good
           | heuristics for turning invalid programs into valid ones that
           | best mimic what the programmer intended. That can be
           | difficult to achieve (good compilers have such heuristics,
           | but even if you have such a compiler, chances are it isn't
           | possible to reuse them for syntax coloring)
           | 
           | If this simpler approach gives you most of what you can get
           | with the AST approach, why bother writing that?
           | 
           | Also, there are languages where some programs can't be
           | perfectly parsed or syntax colored without running them. For
           | those, you need this approach.
        
         | tomcam wrote:
         | > I don't think even Justine could develop that in an interview
         | 
         | Not so sure I'd put money on that opinion ;)
        
       | susam wrote:
       | > Every C programmers (sic) knows you can't embed a multi-line
       | comment in a multi-line comment.
       | 
       | And every Standard ML programmer might find this to be a
       | surprising limitation. The following is a valid Standard ML
       | program:                 (* (* Nested (**) *) comment *)
       | val _ = print "hello, world\n"
       | 
       | Here is the output:                 $ sml < hello.sml
       | Standard ML of New Jersey (64-bit) v110.99.5 [built: Thu Mar 14
       | 17:56:03 2024]       - = hello, world            $ mlton
       | hello.sml && ./hello       hello, world
       | 
       | Given how C was considered one of the "expressive" languages when
       | it arrived, it's curious that nested comments were never part of
       | the language.
        
         | dahart wrote:
         | There are 3 things I find funny about that comment: ML didn't
         | have single-line comments, so same level of surprising
         | limitation. I've never heard someone refer to C as
         | "expressive", but maybe it was in 1972 when compared to
         | assembly. And what bearing does the comment syntax have on the
         | expressiveness of a language? I would argue absolutely none at
         | all, by _definition_. :P
        
           | susam wrote:
           | > ML didn't have single-line comments, so same level of
           | surprising limitation.
           | 
           | It is not quite clear to me why the lack of single-line
           | comments is such a surprising limitation. After all, a
           | single-line block comment can easily serve as a substitute.
           | However, there is no straightforward workaround for the lack
           | of nested block comments.
           | 
           | > I've never heard someone refer to C as "expressive", but
           | maybe it was in 1972 when compared to assembly.
           | 
           | I was thinking of Fortran in this context. For instance,
           | Fortran 77 lacked function pointers and offered a limited set
           | of control flow structures, along with cumbersome support for
           | recursion. I know Fortran, with its native support for
           | multidimensional arrays, excelled in numerical and scientific
           | computing but C quickly became the preferred language for
           | general purpose computing.
           | 
           | While very few today would consider C a pinnacle of
           | expressiveness, when I was learning C, the landscape of
           | mainstream programming languages was much more restricted. In
           | fact, the preface to the first edition of K&R notes the
           | following:
           | 
           |  _" In our experience, C has proven to be a pleasant,
           | expressive and versatile language for a wide variety of
           | programs."_
           | 
           | C, Pascal, etc. stood out as some of the few mainstream
           | programming languages that offered a reasonable level of
           | expressiveness. Of course, Lisp was exceptionally expressive
           | in its own right, but it wasn't always the best fit for
           | certain applications or environments.
           | 
           | > And what bearing does the comment syntax have on the
           | expressiveness of a language?
           | 
           | Nothing at all. I agree. The expressiveness of C comes from
           | its grammar, which the language parser handles. Support for
           | nested comments, in the context of C, is a concern for the
           | lexer, so indeed one does not directly influence the other.
           | However, it is still curious that a language with such a
           | sophisticated grammar and parser could not allocate a bit of
           | its complexity budget to support nested comments in its
           | lexer. This is a trivial matter, I know, but I still couldn't
           | help but wonder about it.
        
             | dahart wrote:
             | Fair enough. From my perspective, lack of single line
             | comments is a little surprising because most other
             | languages had it at the time (1973, when ML was
             | introduced). Lack of nested comments doesn't seem
             | surprising, because it isn't an important feature for a
             | language, and because most other languages did not have it
             | at the time (1972, when C was introduced).
             | 
             | I can imagine both pro and con arguments for supporting
             | nested comments, but regardless of what I think, C
             | certainly could have added support for nested comments at
             | any time, and hasn't, which suggests that there isn't
             | sufficient need for it. That might be the entire
             | explanation: not even worth a little complexity.
        
               | masfuerte wrote:
               | AFAIK, C didn't get single line comments until C99. They
               | were a C++ feature originally.
        
               | dahart wrote:
               | Oh wow, I didn't remember that, and I did start writing C
               | before 99. I stand corrected. I guess that is a little
               | surprising. ;)
               | 
               | Is true that many languages had single line comments?
               | Maybe I'm forgetting more, but I remember everything else
               | having single line comments... asm, basic, shell. I used
               | Pascal in the 80s and apparently forgot it didn't have
               | line comments either?
        
               | masfuerte wrote:
               | That's my recollection, that most languages had single
               | line comments. Some had multi-line comments but C++ is
               | the first I remember having syntaxes for both. That said,
               | I'm not terribly familiar with pre-80s stuff.
        
               | quietbritishjim wrote:
               | Some C compilers supported it as an unofficial extension
               | well before C99, so that could be why you didn't realise
               | or don't remember. I think that included both Visual
               | Studio (which was really a C++ compiler that could turn
               | off the C++ bits) and GCC with GNU extensions enabled.
        
               | susam wrote:
               | > C certainly could have added support for nested
               | comments at any time
               | 
               | After C89 was ratified, adding nested comments to C would
               | have risked breaking existing code. For instance, this is
               | a valid program in C89:                 #include
               | <stdio.h>            int main() {           /* /* Comment
               | */           printf("hello */ world");           return
               | 0;       }
               | 
               | However, if a later C standard were to introduce nested
               | comments, it would break the above program because then
               | the following part of the program would be recognised as
               | a comment:                     /* /* Comment */
               | printf("hello */
               | 
               | The above text would be ignored. Then the compiler would
               | encounter the following:                     world");
               | 
               | This would lead to errors like _undeclared identifier
               | 'world'_, _missing terminating " character_, etc.
        
               | dahart wrote:
               | Given the neighboring thread where I just learned that
               | the lexer runs before the preprocessor, I'm not sure that
               | would be the outcome. There's no reason to assume the
               | comment terminator wouldn't be ignored in strings. And
               | even today, you can safely write printf("hello //
               | world\n"); without risking a compile error, right?
        
               | susam wrote:
               | > Given the neighboring thread where I just learned that
               | the lexer runs before the preprocessor, I'm not sure that
               | would be the outcome.
               | 
               | That is precisely why nested comments would end up
               | breaking the C89 code example I provided above. I
               | elaborate this further below.
               | 
               | > There's no reason to assume the comment terminator
               | wouldn't be ignored in strings.
               | 
               | There is no notion of "comment terminator in strings" in
               | C. At any point of time, the lexer is reading either a
               | string or a comment but never one within the other. For
               | example, in C89, C99, etc., this is an invalid C program
               | too:                 #include <stdio.h>            int
               | main() {           /* Comment           printf("hello */
               | world");           return 0;       }
               | 
               | In this case, we wouldn't say that the lexer is "honoring
               | the comment terminator in a string" because, at the point
               | the comment terminator '*/' is read, there is no active
               | string. There is only a comment that looks like this:
               | /* Comment           printf("hello */
               | 
               | The double quotation mark within the comment is
               | immaterial. It is simply part of the comment. Once the
               | lexer has read the opening '/*', it looks for the
               | terminating '*/'. This behaviour would hold even if
               | future C standards were to allow nested comments, which
               | is why nested comments would break the C89 example I
               | mentioned in my earlier HN comment.
               | 
               | > And even today, you can safely write printf("hello //
               | world\n"); without risking a compile error, right?
               | 
               | Right. But it is not clear what this has got to do with
               | my concern that nested comments would break valid C89
               | programs. In this printf() example, we only have an
               | ordinary string, so obviously this compiles fine. Once
               | the lexer has read the opening quotation mark as the
               | beginning of a string, it looks for an unescaped
               | terminating quotation mark. So clearly, everything until
               | the unescaped terminating quotation mark is a string!
        
             | pklausler wrote:
             | > Fortran 77 lacked function pointers
             | 
             | But we did have dummy procedures, which covered one of the
             | important use cases directly, and which could be abused to
             | fake function/subroutine pointers stored in data.
        
           | michaelcampbell wrote:
           | I was barely too young for this to make much of an impact at
           | the time, (but older than many, perhaps most, here), I
           | understand why C was considered a "high level language", but
           | it still hits me weird, given today's context.
        
         | gsliepen wrote:
         | Well there is one way to nest comments in C, and that's by
         | using #if 0:                 #if 0       This is a       #if 0
         | nested comment!       #endif       #endif
        
           | fanf2 wrote:
           | Except that text inside #if 0 still has to lex correctly.
           | 
           | (unifdef has some evil code to support using C-style
           | preprocessor directives with non-C source, which mostly boils
           | down to ignoring comments. I don't recommend it!)
        
             | dahart wrote:
             | > Except that text inside #if 0 still has to lex correctly.
             | 
             | Are you sure? I just tried on godbolt and that's not true
             | with gcc 14.2. I've definitely put syntax errors
             | intentionally into #if 0 blocks and had it compile. Are you
             | thinking of some older version or something? I thought the
             | pre-processor ran before the lexer since always...
        
               | fanf2 wrote:
               | There are three (relevant) phases (see "translation
               | phases" in section 5 of the standard):
               | 
               | * program is lexed into preprocessing tokens; comments
               | turn into whitespace
               | 
               | * preprocessor does its thing
               | 
               | * preprocessor tokens are turned into proper tokens;
               | different kinds of number are disambiguated; keywords and
               | identifiers are disambiguated
               | 
               | If you put an unclosed comment inside #if 0 then it won't
               | work as you might expect.
        
               | dahart wrote:
               | Ah, I see. You're right!
        
         | kragen wrote:
         | This is not just true of Standard ML; it's also true of regular
         | ML.
        
         | layer8 wrote:
         | Lexing nested comments requires maintaining a stack (or at
         | least a nesting-level counter). That wasn't traditionally seen
         | as being within the realm of lexical analysis, which would only
         | use a finite-state automaton, like regular expressions.
        
         | akira2501 wrote:
         | Pascal always supported the same nested comment syntax as your
         | example.
        
       | lupire wrote:
       | > You'll notice its hash function only needs to consider a single
       | character in in a string. That's what makes it perfect,
       | 
       | Is that a joke?
       | 
       | https://en.m.wikipedia.org/wiki/Perfect_hash_function
        
         | jaen wrote:
         | No. Taking the value of a single character is a correct perfect
         | hash function, assuming there exists a position for the input
         | string set where all characters differ.
        
       | playingalong wrote:
       | Nice read.
       | 
       | I guess the article could be called Falsehoods Programmers Assume
       | of Programming Language Syntaxes.
        
       | TomatoCo wrote:
       | I think my favorite C trigraph was something like
       | do_action() ??!??! handle_error()
       | 
       | It almost looks like special error handling syntax but still
       | remains satisfying once you realize it's an || logical-or
       | statement and it's using short circuiting rules to execute handle
       | error if the action returns a non-zero value.
        
         | wslh wrote:
         | Did you choose the legacy C trigraphs over || for aesthetic
         | purposes?
        
           | wslh wrote:
           | Could you review my comment on HN? Please educate me if there
           | is something I haven't understood, rather than downvoting my
           | question.
        
             | samatman wrote:
             | The grandparent post is specifically about trigraphs.
             | Saying something about trigraphs was the end-in-itself,
             | trigraphs were chosen to illustrate something about
             | trigraphs. So your question made no sense. Hope that helps.
        
               | Izkata wrote:
               | Maybe the confusion was the other way, more like "why is
               | that funny/interesting?"
               | 
               | An attempt to answer that: In English, mixing ?! at the
               | end of a question is a way of indicating bewilderment.
               | Like "What was that?!"
        
               | wslh wrote:
               | My question was precisely about why the user like
               | trigraphs over using just || on this case. It is a very
               | clear question and makes all the sense.
        
               | teo_zero wrote:
               | I didn't downvote your comment but understand why it
               | looks "wrong": it's like, in a thread on English
               | oddities, you replied to someone bringing up the "buffalo
               | buffalo buffalo" example with the question "why are you
               | so fond of bovines"?
        
               | wslh wrote:
               | It has nothing to do with that. I could ask why he didn't
               | choose a different homonymic ambiguity [1].
               | 
               | [1] https://journals.linguisticsociety.org/proceedings/in
               | dex.php...
        
               | kergonath wrote:
               | The post shows a "favorite C trigraph" thing, not that
               | they were going out of their way to use trigraphs in
               | actual code or that you should. Using trigraphs is the
               | whole premise so no, your question makes no sense in that
               | context.
               | 
               | FWIW the ??!??! double trigraph as error processing is
               | funny because of the meaning of ?! and various
               | combinations of ? and !. It is funny and it has
               | trigraphs. That's the whole point.
        
               | mstade wrote:
               | My reading of the downvoted question was one of genuine
               | curiosity of why the author chose that as a favorite
               | trigraph, as in "why that one instead of another", not as
               | criticism of the choice of trigraph over something more
               | conventional. I may be wrong of course, but it didn't
               | seem like a particularly malicious question to me and
               | your rationale unfortunately doesn't convince me
               | otherwise. Not that it has to, this is all very
               | subjective after all, but just offering up a counter
               | opinion.
               | 
               | I gave the question a +1 because I, as previously stated,
               | read it to be genuine curiosity. Maybe a smiley would've
               | helped, I don't know. -\\_(tsu)_/-
        
               | wslh wrote:
               | > The post shows a "favorite C trigraph" thing, not that
               | they were going out of their way to use trigraphs in
               | actual code or that you should.
               | 
               | But I am free to be curious and ask the author why he
               | choose it! We are not computers but human beings! There
               | is no HN rule that says that I cannot be curious and asks
               | a question that arised from a thread but it is not
               | connected to that! [1].
               | 
               | [1] https://www.iflscience.com/charles-babbage-once-sent-
               | the-mos...
        
               | JadeNB wrote:
               | > So your question made no sense. Hope that helps.
               | 
               | I think that is uncharitable. The question ("Did you
               | choose the legacy C trigraphs over || for aesthetic
               | purposes?") makes perfect sense to me. I think context
               | makes it reasonably clear that the answer is 'yes,' but
               | that doesn't mean that the question doesn't make sense,
               | only perhaps that it didn't need to be asked.
        
             | James_K wrote:
             | Easiest way to get downvotes is to ask people not to give
             | them. You just gotta ignore the haters.
        
         | jacobn wrote:
         | https://en.wikipedia.org/wiki/Digraphs_and_trigraphs_(progra...
         | 
         | ??! Is converted to |
         | 
         | So ??!??! Becomes || i.e. "or"
        
       | IshKebab wrote:
       | I don't understand why you wouldn't use Tree Sitter's syntax
       | highlighting for this. I mean it's not going to be as fast but
       | that clearly isn't an issue here.
       | 
       | Is this a "no third party dependencies" thing?
        
         | jart wrote:
         | I don't want to require everyone who builds llamafile from
         | source need to install rust. I don't even require that people
         | install the gperf command, since I can build gperf as a 700kb
         | actually portable executable and vendor it in the repo. Tree
         | sitter I'd imagine does a really great highly precise job with
         | the languages it supports. However it appears to support fewer
         | of them than I am currently. I'm taking a breadth first
         | approach to syntax highlighting, due to the enormity of
         | languages LLMs understand.
        
           | IshKebab wrote:
           | I think the Rust component of tree-sitter-highlight is
           | actually pretty small (Tree Sitter generates C for the actual
           | parser).
           | 
           | But fair enough - fewer dependencies is always nice,
           | especially in C++ (which doesn't have a modern package
           | manager) and in ML where an enormous janky Python
           | installation is apparently a perfectly normal thing to
           | require.
        
             | mdaniel wrote:
             | I somehow thought Conan[1] was the C++ package manager;
             | it's at least partially supported by GitLab, for what
             | that's worth
             | 
             | 1: https://docs.conan.io/2/introduction.html
        
               | IshKebab wrote:
               | No, if anything vcpkg is "the C++ package manager", but
               | it's nowhere near pervasive and easy-to-use enough to
               | come close to even Pip. It's leagues away from Cargo, Go,
               | and other _actually good_ PL package managers.
        
               | mdaniel wrote:
               | I knew that Microsoft used that on Windows but had no
               | idea it was multi-platform: https://github.com/microsoft/
               | vcpkg/releases/tag/2024.10.21 _(MIT, like a lot of their
               | stuff)_
               | 
               | Microsoft is such an odd duck, sometimes, but I'm glad to
               | take advantage of their "good years" while it lasts
        
         | chubot wrote:
         | Have you developed against TreeSitter? Some feedback from
         | people who use it here -
         | https://news.ycombinator.com/item?id=39783471
         | 
         | And here -
         | https://lobste.rs/s/9huy81/tbsp_tree_based_source_processing...
        
           | IshKebab wrote:
           | Yes I have, and it worked very well for what I was using it
           | for (assembly language LSP server). I didn't run into any of
           | the issues they mentioned (not saying they don't exist
           | though).
           | 
           | For new projects I use Chumsky. It's a pure Rust parser which
           | is nice because it means you avoid the generated C, and it
           | also gives you a fully parsed and natively typed output,
           | rather than Tree Sitter's dynamically typed tree of nodes,
           | which means there's no extra parsing step to do.
           | 
           | The main downside is it's more complicated to write the
           | parser (some fairly extreme types). The API isn't stable yet
           | either. But overall I like it more than Tree Sitter.
        
       | jim_lawless wrote:
       | Forth has a default syntax, but Forth code can execute during the
       | compilation process allowing it to accept/compile custom
       | syntaxes.
        
       | SonOfLilit wrote:
       | Justine gets very close to the hairiest parsing issue in any
       | language without encountering it:
       | 
       | Perl's syntax is undecidable, because the difference between
       | treating some characters as a comment or as a regex can depend on
       | the type of a variable that is only determined e.g. based on
       | whether a search for a Collatz counterexample terminates, or
       | just, you know, user input.
       | 
       | https://perlmonks.org/?node_id=663393
       | 
       | C++ templates have a similar issue, I think.
        
         | fanf2 wrote:
         | I think possibly the most hilariously complicated instance of
         | this is in perl's tokenizer, toke.c (which starts with a
         | Tolkien quote, 'It all comes from here, the stench and the
         | peril.' -- Frodo).
         | 
         | There's a function called intuit_more which works out if
         | $var[stuff] inside a regex is a variable interpolation followed
         | by a character class, or an array element interpolation. Its
         | result can depend on whether something in the stuff has been
         | declared as a variable or not.
         | 
         | But even if you ignore the undecidability, the rest is still
         | ridiculously complicated.
         | 
         | https://github.com/Perl/perl5/blob/blead/toke.c#L4502
        
           | ufo wrote:
           | Wow. I wonder how that function came to be in the first
           | place. Surely it couldn't have started out that complicated?
        
         | swolchok wrote:
         | > C++ templates have a similar issue
         | 
         | TIL! I went and dug up a citation:
         | https://blog.reverberate.org/2013/08/parsing-c-is-literally-...
        
         | layer8 wrote:
         | How could a search for a Collatz counterexample possibly
         | terminate? ;)
        
         | chubot wrote:
         | Yup, bash and GNU Make have the same issue as Perl does, and I
         | mention the C++ issue here too:
         | 
         |  _Parsing Bash is Undecidable_ -
         | https://www.oilshell.org/blog/2016/10/20.html
         | 
         | I remember a talk from Larry Wall on Perl 6 (now Raku), where
         | he says this type of thing is a mistake. Raku can be statically
         | parsed, as far as I know.
        
           | jwilk wrote:
           | Parsing POSIX shell in undecidable too:
           | 
           | https://news.ycombinator.com/item?id=30362718
        
             | chubot wrote:
             | Yes, good point -- aliases makes parse time depend on
             | runtime. That is mentioned in
             | 
             |  _Morbig: A static parser for POSIX shell_ - https://schola
             | r.google.com/scholar?cluster=15754961728999604...
             | 
             | (at the time I wrote the post about bash, I hadn't
             | implemented aliases yet)
             | 
             | But it's a little different since it is an intentional
             | feature, not an acccident. It's designed to literally
             | reinvoke the parser at runtime. I think it's not that
             | good/useful a feature, and I tend to avoid it, but many
             | people use it.
        
       | petesergeant wrote:
       | > Perl also has this goofy convention for writing man pages in
       | your source code
       | 
       | The world corpus of software would be much better documented if
       | everywhere else had stolen this from Perl. Inline POD is great.
        
         | kragen wrote:
         | Perl and Python stole it from Emacs Lisp, though Perl took it
         | further. I'm not sure where Java stole it from, but nowadays
         | Doxygen is pretty common for C code. Unfortunately this results
         | in people thinking that Javadoc and Doxygen are substitutes for
         | actual documentation like the Emacs Lisp Reference Manual,
         | which cannot be generated from docstrings, because the
         | organization of the source code is hopelessly inadequate for a
         | reference manual.
        
           | mdaniel wrote:
           | > Emacs Lisp Reference Manual, which cannot be generated from
           | docstrings, because the organization of the source code is
           | hopelessly inadequate for a reference manual.
           | 
           | Well, they're not doing themselves any favors by just willy
           | nilly mixing C with "user-facing" defuns <https://emba.gnu.or
           | g/emacs/emacs/-/blob/ed1d691184df4b50da6b...>. I was curious
           | if they could benefit from "literate programming" since
           | OrgMode is _the bee 's knees_ but not with that style coding
           | they can't
        
             | kragen wrote:
             | I didn't mean that specifically the Emacs source code was
             | not organized in the right way for a reference manual. I
             | meant that C and Java source code in general isn't. And
             | C++, which is actually where people use Doxygen more.
             | 
             | The Python standard library manual is also exemplary, and
             | also necessarily organized differently from the source
             | code.
        
               | mdaniel wrote:
               | > The Python standard library manual is also exemplary
               | 
               | Maybe parts of it are, but as a concrete example
               | https://docs.python.org/3/library/re.html#re.match is
               | just some YOLO about what, _specifically_ , is the first
               | argument to re.match: string, or compiled expression?
               | Well, it's both! Huzzah! I guess they get points for
               | consistency because the first argument to re.compile is
               | also "both"
               | 
               | But, any idea what type re.compile returns? cause
               | https://docs.python.org/3/library/re.html#re.compile is
               | all "don't you worry about it" versus its re.match friend
               | who goes out of their way to state that it is an re.Match
               | object
               | 
               | Would it have been so hard to actually state it, versus
               | requiring someone to invoke type() to get <class
               | 're.Pattern'>?
        
               | kragen wrote:
               | I'm surprised to see that it's allowed to pass a compiled
               | expression to re.match, since the regular expression
               | object has a .match method of its own. To me the fact
               | that the argument is called _pattern_ implies that it 's
               | a string, because at the beginning of that chapter, it
               | says, "Both patterns and strings to be searched can be
               | Unicode strings ( _str_ ) as well as 8-bit strings (
               | _bytes_ ). (...) Usually patterns will be expressed in
               | Python code using this raw string notation."
               | 
               | But this ability to pass a compiled regexp rather than a
               | string can't have been an accidental feature, so I don't
               | know why it isn't documented.
               | 
               | Probably it would be good to have an example of invoking
               | re.match with a literal string in the documentation item
               | for re.match that you linked. There are sixteen such
               | examples in the chapter, the first being re.match(r"(\w+)
               | (\w+)", "Isaac Newton, physicist"), so you aren't going
               | to be able to read much of the chapter without figuring
               | out that you can pass a string there, but all sixteen of
               | them come after that section. A useful example might be:
               | >>> [s for s in ["", " ", "a ", " a", "aa"] if
               | re.match(r'\w', s)]         ['a ', 'aa']
               | 
               | It's easy to make manuals worse by adding too much text
               | to them, but in this case I think a small example like
               | that would be an improvement.
               | 
               | As for what type re.compile returns, the section you
               | linked to says, "Compile a regular expression pattern
               | into a regular expression object, which can be used for
               | matching using its match(), search() and other methods,
               | described below." Is your criticism that it doesn't
               | explicitly say that the regular expression object is
               | _returned_ (as opposed to, I suppose, stored in a table
               | somewhere), or that it says  "a regular expression
               | object" instead of saying "an re.Pattern object"? Because
               | the words "regular expression object" are a link to the
               | "Regular Expression Objects" section, which begins by
               | saying, "class re.Pattern: Compiled regular expression
               | object returned by re.compile()." To me the name of the
               | class doesn't seem like it adds much value here--to write
               | programs that work using the re module, you don't need to
               | know the name of the class the regular expression objects
               | belong to, just what interface they support.
               | 
               | (It's unfortunate that the class name is documented,
               | because it would be better to rename it to a term that
               | wasn't already defined to mean "a string that can be
               | compiled to a regular expression object"!)
               | 
               | But possibly I've been using the re module long enough
               | that I'm blind to the deficiencies in its documentation?
               | 
               | Anyway, I think documentation questions like this, about
               | gradual introduction, forward references, sequencing,
               | publicly documented (and thus stable) versus internal-
               | only names, etc., are hard to reconcile with the
               | constraints of source code, impossible in most languages.
               | In this case the source code is divided between Python
               | and C, adding difficulty.
        
       | metadat wrote:
       | _> The languages I decided to support are Ada, Assembly, BASIC,
       | C, C#, C++, COBOL, CSS, D, FORTH, FORTRAN, Go, Haskell, HTML,
       | Java, JavaScript, Julia, JSON, Kotlin, ld, LISP, Lua, m4, Make,
       | Markdown, MATLAB, Pascal, Perl, PHP, Python, R, Ruby, Rust,
       | Scala, Shell, SQL, Swift, Tcl, TeX, TXT, TypeScript, and Zig._
       | 
       | A few (admittedly silly) questions about the list:
       | 
       | 1. Why no Erlang, Elixir, or Crystal?
       | 
       | Erlang appears to be just at the author's boundary at #47 on the
       | TIOBE index. https://www.tiobe.com/tiobe-index/
       | 
       | 2. What is _" Shell"_? Sh, Bash, Zsh, Windows Cmd, PowerShell..?
       | 
       | 3. Perl but no Awk? Curious why, because Awk is a similar but
       | comparatively trivial language. Widely used, too.
       | 
       | To be fair, Awk, Erlang, and Elixir rank low on popularity. Yet
       | m4, Tcl, TeX, and Zig aren't registered in the top 50 at all.
       | 
       | What's the methodology / criteria? Only things the author is
       | already familiar with?
       | 
       | Still a fun article.
        
         | Yasuraka wrote:
         | Tiobes's index is quite literally worthless, especially with
         | regards to its stated purpose, let alone as a general point of
         | orientation.
         | 
         | I'd wish that purple would stop lending it any credibility.
        
         | Kwpolska wrote:
         | "Shell" in the context of a syntax highlighting language picker
         | almost always means a Unixy shell, most likely something along
         | the lines of Bash.
        
       | dakiol wrote:
       | Wouldn't be possible to let the LLM do the highlighting? Instead
       | of returning code in plain text, it could return code within html
       | with the appropriate tags. Maybe it's harder than it sounds...
       | but if it's just for highlighting the code the LLM returns, I
       | wouldn't mind the highlighting not being 100% accurate.
        
         | trashburger wrote:
         | Would be much slower and eat up precious context window.
        
       | layer8 wrote:
       | The author may have missed that lexing C is actually context-
       | sensitive, i.e. you need a symbol table:
       | https://en.wikipedia.org/wiki/Lexer_hack
       | 
       | Of course, for syntax highlighting this is only relevant if you
       | want to highlight the multiplication operator differently from
       | the dereferencing operator, or declarations differently from
       | expressions.
       | 
       | More generally, however, I find it useful to highlight (say)
       | types differently from variables or functions, which in some
       | (most?) popular languages requires full parsing and symbol table
       | information. Some IDEs therefore implement two levels of syntax
       | highlighting, a basic one that only requires lexical information,
       | and an extended one that kicks in when full grammar and type
       | information becomes available.
        
         | legobmw99 wrote:
         | I'd be shocked if jart didn't know this, but it seems unlikely
         | that an LLM would generate one of these most vexing parses,
         | unless explicitly asked
        
           | layer8 wrote:
           | Given all the things that were new to the author in the
           | article, I wouldn't be shocked at all. There's just a huge
           | number of things to know, or to have come across.
        
             | jraph wrote:
             | Justine is proficient in C, she is the author of a libc
             | (cosmopolitan) among other things, like Actually Portable
             | Executables [1].
             | 
             | I would expect her to know C quite well, and that's
             | probably an understatement.
             | 
             | [1] https://justine.lol/ape.html
        
           | quietbritishjim wrote:
           | I think you're thinking of something different to the issue
           | in the parent comment. The most vexing parse is, as the name
           | suggests, a problem at the parsing stage rather than the
           | earlier lexing phase. Unlike the referenced lexing problem,
           | it does't require any hack for compilers to deal with it.
           | That's because it's not really a problem for the compiler;
           | it's humans that find it surprising.
        
         | alekratz wrote:
         | I don't think the lexer hack is relevant in this instance. The
         | lexer hack just refers to the ambiguity of `A * B` and whether
         | that should be parsed as a variable declaration or an
         | expression. If you're building a syntax tree, then this
         | matters, but AFAICT all the author needs is a sequence of
         | tokens and not a syntax tree. Maybe "parser hack" would be a
         | better name for it.
        
         | teo_zero wrote:
         | > this is only relevant if you want to highlight the
         | multiplication operator differently from the dereferencing
         | operator
         | 
         | Can you mention one editor which does that?
        
           | quietbritishjim wrote:
           | I don't think they implied there is. The sentence you quoted
           | is essentially "this is relevant for their article about
           | weird lexical syntax, but (almost definitely) not relevant to
           | their original problem of syntax highlighting".
        
           | mdaniel wrote:
           | I could be stretching the definition of "does" but the
           | newfound(?) tree-sitter support in Emacs[1] I believe would
           | allow that since it for sure understands the distinction but
           | I don't possess enough font-lock ninjary to actually, for
           | real, bind a different color to the distinct usages
           | /* given foo.c */       int main() {         int a, *b;
           | a = 5 * 10;         b = &a;         printf("a is %d\n", *b);
           | }
           | 
           | and then M-x c-ts-mode followed by navigating to each * and
           | invoking M-x treesit-inspect-node-at-point in turn produces,
           | respectively:                 (declaration declarator:
           | (pointer_declarator "*"))            right:
           | (binary_expression operator: "*")            arguments:
           | (argument_list (pointer_expression operator: "*"))
           | 
           | 1: https://www.emacswiki.org/emacs/Tree-sitter
        
             | teo_zero wrote:
             | These examples are unambiguous. Try with something more
             | spicy like                 return (A)*(B);
             | 
             | which depends on A being a type or a variable.
        
           | dummy7777 wrote:
           | hey
        
       | murkt wrote:
       | Author hasn't tried to highlight TeX. Which is good for their
       | mental health, I suppose, as it's generally impossible to fully
       | highlight TeX without interpreting it.
       | 
       | Even parsing is not enough, as it's possible to redefine what
       | each character does. You can make it do things like "and now K
       | means { and C means }".
       | 
       | Yes, you can find papers on arXiv that use this god-forsaken
       | feature.
        
         | jart wrote:
         | I wrote https://github.com/Mozilla-
         | Ocho/llamafile/blob/main/llamafil... and it does a reasonable
         | job highlighting without breaking for all the .tex files I
         | could find on my hard drive. My goal is to hopefully cover
         | 99.9% of real world usage, since that'll likely cover
         | everything an LLM might output. Esoteric syntax also usually
         | isn't a problem, so long as it doesn't cause strings and
         | comments to extend forever, eclipsing the rest of the source
         | code in a file.
        
           | murkt wrote:
           | Yes, when goal isn't to support 100% of all the weird stuff,
           | then it's orders of magnitude easier!
        
         | nathell wrote:
         | Same with Common Lisp (you can redefine the read table),
         | although that's likely abused less often on arXiv.
        
         | bobbylarrybobby wrote:
         | I couldn't believe it when I learned that \makeatletter does
         | not "make (something) at a letter (character)" but rather
         | "treats the '@' character as a letter when parsing".
        
       | xonix wrote:
       | No AWK?
        
       | sundarurfriend wrote:
       | The final line number count is missing Julia. Based on the file
       | in the repo, it would be at the bottom of the first column:
       | between ld and R.
       | 
       | Among the niceties listed here, the one I'd wish for Julia to
       | have would be C#'s "However many quotes you put on the lefthand
       | side, that's what'll be used to terminate the string at the other
       | end". Documentation that talks about quoting would be so much
       | easier to read (in source form) with something like that.
        
         | sundarurfriend wrote:
         | One nicety that Julia does have that I didn't know about (or
         | had forgotten) is nested multi-line comments.
         | #= this one            has a #= nested            comment =#
         | inside of it            and that works fine! =#
        
       | nusaru wrote:
       | > Ruby is the union of all earlier languages, and it's not even
       | formally documented.
       | 
       | It's documented, but you need $250 to spare:
       | https://www.iso.org/standard/59579.html
        
         | mdaniel wrote:
         | Well, according to (ahem) _a copy_ that I found, it only goes
         | up to MRI 1.9 and goes out of its way to say  "welp, the world
         | is changing, so we're just going to punt until Ruby stabilizes"
         | which is damn cheating for a _standard_ IMHO
         | 
         | Also, while doing some digging I found there actually are a
         | number of the standards that are legitimately publicly
         | available
         | https://standards.iso.org/ittf/PubliclyAvailableStandards/in...
        
         | vidarh wrote:
         | ISO Ruby is a tiny, dated subset of Ruby. I doubt you'll find
         | much Ruby that conforms to it.
         | 
         | The Ruby everyone uses is much better defined by RubySpec etc.
         | via test cases, but that's not complete either.
        
       | tomcam wrote:
       | > If you ever want to confuse your coworkers, then one great way
       | to abuse this syntax is by replacing the heredoc marker with an
       | empty string
       | 
       | Maybe I am in favor of the death penalty after all
        
       | petters wrote:
       | > I'm not sure who wants to be able to syntax highlight C at 35
       | MB per second, but I am now able to do so
       | 
       | Fast, but tcc *compiles* C to binary code at 29 MB/s on a really
       | old computer: https://bellard.org/tcc/#speed Should be possible
       | to go much faster but probably not needed
        
         | BiteCode_dev wrote:
         | Justine vs Bellard, that's a nice setup.
        
       | transfire wrote:
       | Impressive work!
       | 
       | I am surprised Smalltalk and Prolog are in there though.
        
       | fahrnfahrnfahrn wrote:
       | While developing the syntax for a programming language in the
       | early 80s, I discovered that allowing spaces in identifiers was
       | unambiguous, e.g., upper left corner = scale factor * old upper
       | left corner.
        
       ___________________________________________________________________
       (page generated 2024-11-03 23:01 UTC)