[HN Gopher] Weird Lexical Syntax
___________________________________________________________________
Weird Lexical Syntax
Author : jart
Score : 304 points
Date : 2024-11-02 07:45 UTC (15 hours ago)
(HTM) web link (justine.lol)
(TXT) w3m dump (justine.lol)
| llm_trw wrote:
| I've done a fair bit of forth and I've not seen c" used. The
| usual string printing operator is ." .
| mananaysiempre wrote:
| Counted ("Pascal") strings are rare nowadays so C" is not often
| used. Its addr len equivalent is S" and that one is fairly
| common in string manipulation code.
| kragen wrote:
| Right, _c "_ is for when you want to pass a literal string to
| some other word, not print it. But I agree that it's not very
| common, because you normally use _s "_ for that, which leaves
| the address and length on the stack, while _c "_ leaves just an
| address on the stack, pointing to a one-byte count field
| followed by the bytes. I think adding _c "_ in Forth-83 (and
| renaming _"_ to _s "_) was a mistake, and it would have been
| better to deprecate the standard words that expect or produce
| such counted strings, other than _count_ itself. See
| https://forth-standard.org/standard/alpha, https://forth-
| standard.org/standard/core/Cq, https://forth-
| standard.org/standard/core/COUNT, and https://forth-
| standard.org/standard/core/Sq.
|
| You can easily add new string and comment syntaxes to Forth,
| though. For example, you can add BCPL-style // comments to end
| of line with this line of code in, I believe, all standard
| Forths, though I've only tested it in GForth:
| : // 10 word drop ; immediate
|
| Getting it to work in block files requires more work but is
| still only a few lines of code. The standard word _\_ does
| this, and _see \_ decompiles the GForth implementation as
| : \ blk @ IF >in @ c/l / 1+ c/l * >in !
| EXIT THEN source >in ! drop ; immediate
|
| This kind of thing was commonly done for text editor commands,
| for example; you might define _i_ as a word that reads text
| until the end of the line and inserts it at the current
| position in the editor, rather than discarding it like my //
| above. Among other things, the screen editor in F83 does
| exactly that.
|
| So, as with Perl, PostScript, TeX, m4, and Lisps that support
| readmacros, you can't lex Forth without executing it.
| skrebbel wrote:
| This was a delightful read, thanks!
| croisillon wrote:
| Glad to see confirmed that PHP is the most non weird programming
| language ;)
| rererereferred wrote:
| I recently learned php's heredoc can have space before it and
| it will remove those spaces from the lines in the string:
| $a = <<<EOL This is not indented
| but this has 4 spaces of indentation EOL;
|
| But the spaces have to match, if any line has less spaces than
| the EOL it gives an error.
| alganet wrote:
| There are two types of languages: the ones full of quirks and
| the ones no one uses.
| skitter wrote:
| Another syntax oddity (not mentioned here) that breaks most
| highlighters: In Java, unicode escapes can be anywhere, not just
| in strings. For example, the following is a valid class:
| class Foo\u007b}
|
| and this assert will not trigger: assert
| // String literals can have unicode escapes like \u000A!
| "Hello World".equals("\u00E4");
| ivanjermakov wrote:
| I have never seen this in Java! Is there any use cases where it
| could be useful?
| susam wrote:
| I don't know about usefulness but it does let us write
| identifiers using Unicode characters. For example:
| public class Foo { public static void main(String[]
| args) { double \u03c0 = 3.14159265;
| System.out.println("\u03c0 = " + \u03c0); } }
|
| Output: $ javac Foo.java && java Foo p
| = 3.14159265
|
| Of course, nowadays we can simply write this with any decent
| editor: public class Foo { public
| static void main(String[] args) { double p =
| 3.14159265; System.out.println("p = " + p);
| } }
|
| Support for Unicode escape sequences is a result of how the
| Java Language Specification (JLS) defines InputCharacter.
| Quoting from Section 3.4 of JLS
| <https://docs.oracle.com/javase/specs/jls/se23/jls23.pdf>:
| InputCharacter: UnicodeInputCharacter but not CR or
| LF
|
| UnicodeInputCharacter is defined as the following in section
| 3.3: UnicodeInputCharacter:
| UnicodeEscape RawInputCharacter
| UnicodeEscape: \ UnicodeMarker HexDigit HexDigit
| HexDigit HexDigit UnicodeMarker: u {u}
| HexDigit: (one of) 0 1 2 3 4 5 6 7 8 9 a b c
| d e f A B C D E F RawInputCharacter: any
| Unicode character
|
| As a result the lexical analyser honours Unicode escape
| sequences absolutely anywhere in the program text. For
| example, this is a valid Java program: public
| class Bar { public static void
| \u006d\u0061\u0069\u006e(String[] args) {
| System.out.println("hello, world"); } }
|
| Here is the output: $ javac Bar.java && java
| Bar hello, world
|
| However, this is an incorrect Java program:
| public class Baz { // This comment contains \u6d.
| public static void main(String[] args) {
| System.out.println("hello, world"); } }
|
| Here is the error: $ javac Baz.java
| Baz.java:2: error: illegal unicode escape // This
| comment contains \u6d.
| ^ 1 error
|
| Yes, this is an error even if the illegal Unicode escape
| sequence occurs in a comment!
| ivanjermakov wrote:
| I wonder if full unicode range was accepted because some
| companies are writing code in non-english.
| layer8 wrote:
| Javac uses the platform encoding [0] by default to interpret
| Java source files. This means that Java source code files are
| inherently non-portable. When Java was first developed (and
| for a long time after), this was the default situation for
| any kind of plain text files. The escape sequence syntax
| allows to transform [1] Java source code into a portable
| (that is, ASCII-only) representation that is completely
| equivalent to the original, and also to convert it back to
| any platform encoding.
|
| Source control clients could apply this automatically upon
| checkin/checkout, so that clients with different platform
| encodings can work together. Alternatively, IDEs could do
| this when saving/loading Java source files. That never quite
| caught on, and the general advice was to stick to ASCII, at
| least outside comments.
|
| [0] Since JDK 18, the default encoding defaults to UTF-8.
| This probably also extends to _javac_ , though I haven't
| verified it.
|
| [1] https://docs.oracle.com/javase/8/docs/technotes/tools/win
| dow...
| mistercow wrote:
| I also argue that failing to syntax highlight this correctly is
| a security issue. You can terminate block comments with Unicode
| escapes, so if you wanted to hide some malicious code in a Java
| source file, you just need an excuse for there to be a block of
| Unicode escapes in a comment. A dev who doesn't know about this
| quirk is likely to just skip over it, assuming it's commented
| out.
| mcphage wrote:
| At one point there was an open source project to formally specify
| Ruby, but I don't know if it's still alive:
| https://github.com/ruby/spec
|
| Hmm, it seems to be alive, but based more on behavior than
| syntax.
| keybored wrote:
| Meanwhile NeoVim doesn't syntax highlight my commit message
| properly if I have messed with "commit cleanup" enough.
|
| The comment character in Git commit messages can be a problem
| when you insist on prepending your commits with some "id" and the
| id starts with `#`. One suggestion was to allow backslash escapes
| in commit messages since that makes sense to a computer
| scientist.[1]
|
| But looking at all of this lexical stuff I wonder if makes-sense-
| to-computer-scientist is a good goal. They invented the problem
| of using a uniform delimiter for strings and then had to solve
| their own problem. Maybe it was hard to use backtick in the 70's
| and 80's, but today[2] you could use backtick to start a string
| and a single quote to end it.
|
| What do C-like programming languages use single quotes for? To
| quote characters. Why do you need to quote characters? I've never
| seen a literal character which needed an "end character" marker.
|
| Raw strings would still be useful but you wouldn't need raw
| strings just to do a very basic thing like make a string which
| has typewriter quotes in it.
|
| Of course this was for C-like languages. Don't even get me
| started on shell and related languages where basically everything
| is a string and you have to make a single-quote/double-quote
| battle plan before doing anything slightly nested.
|
| [1] https://lore.kernel.org/git/vpq3808p40o.fsf@anie.imag.fr/
|
| [2] Notwithstanding us Europeans that use a dead-key keyboard
| layout where you have to type twice to get one measly backtick
| (not that I use those)
| pwdisswordfishz wrote:
| > The comment character in Git commit messages can be a problem
| when you insist on prepending your commits with some "id" and
| the id starts with `#`
|
| https://git-scm.com/docs/git-commit#Documentation/git-commit...
| keybored wrote:
| See "commit cleanup".
|
| There's surprising layers to this. That the reporter in that
| thread says that git-commit will "happily" accept `#` in
| commit messages is half-true: it will accept it if you don't
| edit the message since the `default` cleanup (that you linked
| to) will not remove comments if the message is given through
| things like `-m` and not an editing session. So `git commit
| -m'#something' is fine. But then try to do rebase and cherry-
| pick and whatever else later, maybe get a merge commit
| message with a commented "conflicted" files. Well it can get
| confusing.
| kragen wrote:
| > _Maybe it was hard to use backtick in the 70's and 80's, but
| today[2] you could use backtick to start a string and a single
| quote to end it._
|
| That's how quoting works by default in m4 and TeX, both defined
| in the 70s. Unfortunately Unicode retconned the ASCII
| apostrophe character ' to be a vertical line, maybe out of a
| misguided deference to Microsoft Windows, and now we all have
| to suffer the consequences. (Unless we're using Computer Modern
| fonts or other fonts that predate this error, such as VGA font
| ROM dumps.)
|
| In the 70s and 80s, and into the current millennium on Unix,
| `x' did look like 'x', but now instead it looks like dogshit.
| Even if you are willing to require a custom font for
| readability, though, that doesn't solve the problem; you need
| some way to include an apostrophe in your quoted string!
|
| As for end delimiters, C itself supports multicharacter
| literals, which are potentially useful for things like
| Macintosh type and creator codes, or FTP commands.
| Unfortunately, following the Unicode botch theme, the standard
| failed to define an endianness or minimum width for them, so
| they're not very useful today. You can use them as enum values
| if you want to make your memory dumps easier to read in the
| debugger, and that's about it. I think Microsoft's compiler
| botched them so badly that even that's not an option if you
| need your code to run on it.
| ygra wrote:
| > Unfortunately Unicode retconned the ASCII apostrophe
| character ' to be a vertical line
|
| Unicode does not precribe the appearance of characters.
| Although in the code chart1 it says >>neutral (vertical)
| glyph with mixed usage<< (next to >>apostrophe-quote<< and
| >>single quote<<), font vendors have to deal with this mixed
| usage. And with Unicode the correct quotation marks have
| their own code points, making it unnecessary to design fonts
| where the ASCII apostrophe takes their form, but rendering
| all other uses pretty ugly.
|
| I would regard using ` and ' as paired quotation marks as a
| hack from times when typographic expression was simply not
| possible with the character sets of the day.
|
| _________
|
| 1 0027 ' APOSTROPHE = apostrophe-
| quote (1.0) = single quote = APL quote
| * neutral (vertical) glyph with mixed usage * 2019 '
| is preferred for apostrophe * preferred characters in
| English for paired quotation marks are 2018 ' & 2019 '
| * 05F3 ' is preferred for geresh when writing Hebrew
| - 02B9 ' modifier letter prime - 02BC ' modifier
| letter apostrophe - 02C8 ' modifier letter vertical
| line - 0301 $ combining acute accent - 030D $
| combining vertical line above - 05F3 ' hebrew
| punctuation geresh - 2018 ' left single quotation
| mark - 2019 ' right single quotation mark -
| 2032 ' prime - A78C latin small letter saltillo<<
| keybored wrote:
| > That's how quoting works by default in m4 and TeX, both
| defined in the 70s.
|
| Good point. And it was in m4[1] I saw that
| backtick+apostrophe syntax. I would have probably not thought
| of that possibility if I hadn't seen it there.
|
| [1] Probably on Wikipedia since I have never used it
|
| > Unfortunately Unicode retconned the ASCII apostrophe
| character ' to be a vertical line, maybe out of a misguided
| deference to Microsoft Windows, and now we all have to suffer
| the consequences. (Unless we're using Computer Modern fonts
| or other fonts that predate this error, such as VGA font ROM
| dumps.)
|
| I do think the vertical line looks subpar (and I don't use it
| in prose). But most programmers don't seem bothered by it. :|
|
| > In the 70s and 80s, and into the current millennium on
| Unix, `x' did look like 'x', but now instead it looks like
| dogshit.
|
| Emacs tries to render it like 'x' since it uses
| backtick+apostrophe for quotes. With some mixed results in my
| experience.
|
| > Even if you are willing to require a custom font for
| readability, though, that doesn't solve the problem; you need
| some way to include an apostrophe in your quoted string!
|
| Aha, I honestly didn't even think that far. Seems a bit
| restrictive to not be able to use possessives and
| contractions in strings without escapes.
|
| > As for end delimiters, C itself supports multicharacter
| literals, which are potentially useful for things like
| Macintosh type and creator codes, or FTP commands.
|
| I should have made it clear that I was only considering
| C-likes and not C itself. A language from the C trigraph days
| can be excused. To a certain extent.
| kragen wrote:
| I'd forgotten about `' in Emacs documentation! That may be
| influenced by TeX.
|
| C multicharacter literals are unrelated to trigraphs.
| Trigraphs were a mistake added many years later in the ANSI
| process.
| tom_ wrote:
| See also: https://www.cl.cam.ac.uk/~mgk25/ucs/quotes.html
| kragen wrote:
| This is an excellent document. I disagree with its
| normative conclusions, because I think being incompatible
| with ASCII, Unix, Emacs, and TeX is worse than being
| incompatible with ISO-8859-1, Microsoft Windows, and MacOS
| 9, but it is an excellent reference for the factual
| background.
| shawa_a_a wrote:
| The comment character is also configurable:
| git config core.commentchar <char>
|
| This is helpful where you want to use use say, markdown to have
| tidily formatted commit messages make up your pull request body
| too.
| keybored wrote:
| I want to try to set it to `auto` and see what spicy things
| it comes up with.
| yen223 wrote:
| select'select'select
|
| is a perfectly valid SQL query, at least for Postgres.
|
| Languages' approach to whitespace between tokens is all over the
| place
| notsylver wrote:
| As soon as I saw this was part of llamafile I was hoping that it
| would be used to limit LLM output to always be "valid" code as
| soon as it saw the backticks, but I suppose most LLMs don't have
| problems with that anyway. And I'm not sure you'd want something
| like that automatically forcing valid code anyway
| dilap wrote:
| llama.cpp does support something like this -- you can give it a
| grammar which restricts the set of available next tokens that
| are sampled over
|
| so in theory you could notice "```python" or whatever and then
| start restricting to valid python code. (in least in theory,
| not sure how feasible/possible it would be in practice w/ their
| grammar format.)
|
| for code i'm not sure how useful it would be since likely any
| model that is giving you working code wouldn't be struggling w/
| syntax errors anyway?
|
| but i have had success experimentally using the feature to
| drive fiction content for a game from a smaller llm to be in a
| very specific format.
| notsylver wrote:
| yeah, ive used llama.cpp grammars before, which is why i was
| thinking about it. i just think it'd be cool for llamafile to
| do basically that, but with included defaults so you could
| eg, require JSON output. it could be cool for prototyping or
| something. but i dont think that would be too useful anyway,
| most of the time i think you would want to restrict it to a
| specific schema, so i can only see it being useful for
| something like a tiny local LLM for code completion, but that
| would just encourage valid-looking but incorrect code.
|
| i think i just like the idea of restricting LLM output, it
| has a lot of interesting use cases
| dilap wrote:
| gotchya. i do think that is a cool idea actually -- LLMs
| tiny enough to do useful things with formally structured
| output but not big enough to nail the structure ~100% is
| probably not an empty set.
| pwdisswordfishz wrote:
| > Of all the languages, I've saved the best for last, which is
| Ruby. Now here's a language whose syntax evades all attempts at
| understanding.
|
| TeX with its arbitrarily reprogrammable lexer: how adorable
| fanf2 wrote:
| Lisp reader macros allow you to program its lexer too.
| skydhash wrote:
| You can basically define a new language with a few lines of
| code in Racket.
| pansa2 wrote:
| > _TypeScript, Swift, Kotlin, and Scala take string interpolation
| to the furthest extreme of encouraging actual code being embedded
| inside strings. So to highlight a string, one must count curly
| brackets and maintain a stack of parser states._
|
| Presumably this is also true in Python - IIRC the brace-delimited
| fields within f-strings may contain arbitrary expressions.
|
| More generally, this must mean that the lexical grammar of those
| languages isn't regular. "Maintaining a stack" isn't part of a
| finite-state machine for a regular grammar - instead we're in the
| realm of pushdown automata and context-free grammars.
|
| Is it even possible to support generalized string interpolation
| within a strictly regular lexical grammar?
| aphantastic wrote:
| > Is it even possible to support generalized string
| interpolation within a strictly regular lexical grammar?
|
| Almost certainly not, a fun exercise is to attempt to devise a
| Pumping tactic for your proposed language. If it doesn't exist,
| it's not regular.
|
| https://en.m.wikipedia.org/wiki/Pumping_lemma_for_regular_la...
| fanf2 wrote:
| Complicated interpolation can be lexed as a regular language if
| you treat strings as three separate lexical things, eg in
| JavaScript template literals there are,
| `stuff${ }stuff${ }stuff`
|
| so the ${ and } are extra closing and opening string
| delimiters, leaving the nesting to be handled by the parser.
|
| You need a lexer hack so that the lexer does not treat } as the
| start of a string literal, except when the parser is inside an
| interpolation but all nested {} have been closed.
| irdc wrote:
| I'd be interested to see a re-usable implementation of joe's[0]
| syntax highlighting.[1] The format is powerful enough to allow
| for the proper highlighting of Python f-strings.[2]
|
| 0. https://joe-editor.sf.net/
|
| 1. https://github.com/cmur2/joe-
| syntax/blob/joe-4.4/misc/HowItW...
|
| 2.
| https://gist.github.com/irdc/6188f11b1e699d615ce2520f03f1d0d...
| pama wrote:
| Interestingly, python f-strings changed their syntax at version
| 3.12, so highlighting should depend on the version.
| irdc wrote:
| It's just that nesting them arbitrarily is now allowed,
| right? That shouldn't matter much for a mere syntax
| highlighter then. And one could even argue that code that
| relies on this too much is not really for human consumption.
| pansa2 wrote:
| Also, you can now use the same quote character that
| encloses an f-string within the {} expressions. That could
| make them harder to tokenize, because it makes it harder to
| recognise the end of the string.
| rererereferred wrote:
| In the C# multiquoted strings, how does it know this:
| Console.WriteLine(""""""); Console.WriteLine("""""");
|
| Are 2 triplequoted empty strings and not one
| "\nConsole.WriteLine(" sixtuplequoted string?
| ygra wrote:
| The former, I'd say.
|
| https://learn.microsoft.com/en-us/dotnet/csharp/programming-...
|
| For a multi-line string the quotes have to be on their own
| line.
| Joker_vD wrote:
| If the opening quotes are followed by anything that is not a
| whitespace before the next new-line (or EOF), then it's a
| single-line string.
|
| I imagine implementing those things took several iterations :)
| yen223 wrote:
| It's a syntax error! Unterminated raw string
| literal.
|
| https://replit.com/@Wei-YenYen/DistantAdmirableCareware#main...
| Joker_vD wrote:
| Ah, so there is no backtracking in lexer for this case. Makes
| sense.
| ygra wrote:
| As for C#'s triple-quoted strings, they actually came from Java
| before and C# ended up adopting the same or almost the same
| semantics. Including stripping leading whitespace.
| pdw wrote:
| Some random things that the author seem to have missed:
|
| > but TypeScript, Swift, Kotlin, and Scala take string
| interpolation to the furthest extreme of encouraging actual code
| being embedded inside strings
|
| Many more languages support that: C#
| $"{x} plus {y} equals {x + y}" Python f"{x} plus
| {y} equals {x + y}" JavaScript `${x} plus ${y} equals
| ${x + y}` Ruby "#{x} plus #{y} equals #{x + y}"
| Shell "$x plus $y equals $(echo "$x+$y" | bc)"
| Make :) echo "$(x) plus $(y) equals $(shell echo "$x+$y" |
| bc)"
|
| > Tcl
|
| Tcl is funny because comments are only recognized in code, and
| since it's a homoiconic, it's very hard to distinguish code and
| data. { } are just funny string delimiters. E.g.:
| xyzzy {#hello world}
|
| Is xyzzy a command that takes a code block or a string? There's
| no way to tell. (Yes, that means that the Tcl tokenizer/parser
| cannot discard comments: only at evaluation time it's possible to
| tell if something is a comment or not.)
|
| > SQL
|
| PostgreSQL has the very convenient dollar-quoted strings:
| https://www.postgresql.org/docs/current/sql-syntax-lexical.h...
| E.g. these are equivalent: 'Dianne''s horse'
| $$Dianne's horse$$ $SomeTag$Dianne's horse$SomeTag$
| autarch wrote:
| Perl lets you do this too: my $foo = 5;
| my $bar = 'x'; my $quux = "I have $foo $bar\'s: @{[$bar
| x $foo]}"; print "$quux\n";
|
| This prints out: I have 5 x's: xxxxx
|
| The "@{[...]}" syntax is abusing Perl's ability to interpolate
| an _array_ as well as a scalar. The inner "[...]" creates an
| array reference and the outer "@{...}" dereferences it.
|
| For reasons I don't remember, the Perl interpreter allows
| arbitrary code in the inner "[...]" expression that creates the
| array reference.
| Izkata wrote:
| > For reasons I don't remember, the Perl interpreter allows
| arbitrary code in the inner "[...]" expression that creates
| the array reference.
|
| ...because it's an array value? Aside from how the languages
| handle references, how is that part any different from, for
| example, this in python: >>> [5 * 'x']
| ['xxxxx']
|
| You can put (almost) anything there, as long as it's an
| expression that evaluates to a value. The resulting value is
| what goes into the array.
| autarch wrote:
| I understand that's constructing an array. What's a bit odd
| is that the interpreter allows you to string interpolate
| any expression when constructing the array reference inside
| the string.
| Izkata wrote:
| It's not...? Well, not directly: It's string
| interpolating an array of values, and the array is
| constructed using values from the results of expressions.
| These are separate features that compose nicely.
| weinzierl wrote:
| You also don't need quotes around strings (barewords). So
| my $bar = x;
|
| should give the same result.
|
| Good luck with lexing that properly.
|
| https://perlmaven.com/barewords-in-perl
| layer8 wrote:
| > actual code being embedded inside strings
|
| My view on this is that it shouldn't be interpreted as code
| being embedded inside strings, but as a special form of string
| concatenation syntax. In turn, this would mean that you can
| nest the syntax, for example: "foo {
| toUpper("bar { x + y } bar") } foo"
|
| The individual tokens being (one per line):
| "foo { toUpper ( "bar { x
| + y } bar" ) } foo"
|
| If `+` does string concatenation, the above would effectively
| be equivalent to: "foo " + toUpper("bar " +
| (x + y) + " bar") + " foo"
|
| I don't know if there is a language that actually works that
| way.
| panzi wrote:
| Indeed in some of the listed languages you can nest it like
| that, but in others (e.g. Python) you can't. I would guess
| they deliberately don't want to enable that and it's not a
| problem in their parser or something.
| layer8 wrote:
| Even when nesting is disallowed, my point is that I find it
| preferable to not view it (and syntax-highlight it) as a
| "special string" with embedded magic, but as multiple
| string literals with just different delimiters that allow
| omitting the explicit concatenation operator, and normal
| expressions interspersed in between. I think it's important
| to realize that it is really just very simple syntactic
| sugar for normal string concatenation.
| Tarean wrote:
| As of python 3.6 you can nest fstrings. Not all formatters
| and highlighters have caught up, though.
|
| Which is fun, because correct highlighting depends on
| language version. Haskell has similar problems where
| different compiler flags require different parsers. Close
| enough is sufficient for syntax highlighting, though.
|
| Python is also a bit weird because it calls the format
| methods, so objects can intercept and react to the format
| specifiers in the f-string while being formatted.
| panzi wrote:
| I didn't mean nested f-strings. I mean this is a syntax
| error: >>> print(f"foo {"bar"}")
| SyntaxError: f-string: expecting '}'
|
| Only this works: >>> print(f"foo
| {'bar'}") foo bar
| pdw wrote:
| You're using an old Python version. On recent versions,
| it's perfectly fine: Python 3.12.7
| (main, Oct 3 2024, 15:15:22) [GCC 14.2.0] on linux
| Type "help", "copyright", "credits" or "license" for more
| information. >>> print(f"foo {"bar"}")
| foo bar
| epcoa wrote:
| > "foo { ...
|
| That should probably not be one token.
|
| > My view on this is that it shouldn't be interpreted as code
| being embedded inside strings
|
| I'm not sure exactly what you're proposing and how it is
| different. You still can't parse it as a regular lexical
| grammar.
|
| How does this change how you highlight either?
|
| Whatever you call it, to the lexer it is a special string, it
| has to know how to match it, the delimiters are materially
| different than concatenation.
|
| I might be being dense but I'm not sure what's formally
| distinct.
| panzi wrote:
| Is this a bash-ism? "$x plus $y equals
| $((x+y))"
| jonahx wrote:
| This works in "sh" as well for me.
| panzi wrote:
| On some systems (like on mine) sh is just a link to bash,
| so I couldn't test it.
| jwilk wrote:
| No, it's portable shell syntax.
| LukeShu wrote:
| "$((" arithmetic expansion is POSIX (XCU 2.6.4 "Arithmetic
| Expansion").
|
| But if I'm not mistaken, it originated in csh.
| susam wrote:
| > Is this a bash-ism?
|
| > "$x plus $y equals $((x+y))"
|
| No, it is specified in POSIX: https://pubs.opengroup.org/onli
| nepubs/9699919799/utilities/V...
| therein wrote:
| > PostgreSQL has the very convenient dollar-quoted strings
|
| I did not know that. Today I learned.
| sundarurfriend wrote:
| > Many more languages support that:
|
| Julia as well: Julia "$x plus $y equals
| $(x+y)"
| thesz wrote:
| VHDL
|
| There is a record constructor syntax in VHDL using attribute
| invocation syntax: RECORD_TYPE'(field1expr, ..., fieldNexpr).
| This means that if your record has a first field a subtype of a
| character type, you can get record construction expression like
| this one: REC'('0',1,"10101").
|
| Good luck distinguishing between '(' as a character literal and
| "'", "(" and "'0'" at lexical level.
|
| Haskell.
|
| Haskell has context-free syntax for bracketed ("{-" ... "-}")
| comments. Lexer has to keep bracketed comment syntax balanced
| (for every "{-" there should be accompanying "-}" somewhere).
| 1vuio0pswjnm7 wrote:
| Shell "$x plus $y equals $((x+y))"
|
| Shell "$x plus $y equals $((expr $x + $y))"
| __MatrixMan__ wrote:
| This was a fun read, but it left me a bit more sympathetic to the
| lisp perspective, which (if I've understood it) is that syntax,
| being not an especially important part of a language, is more of
| a hurdle than a help, and should be as simple and uniform as
| possible so we can focus on other things.
|
| Which is sort of ironic because learning how to do structural
| editing on lisps has absolutely been more hurdle than help so
| far, but I'm sure it'll pay off eventually.
| mqus wrote:
| Having a simple syntax might be fine for computers but syntax
| is mainly designed to be read and written by humans. Having a
| simple one like lisp then just makes syntactic discussions a
| semantic problem, just shifting the layers.
|
| And I think an complex syntax is far easier to read and write
| than a simple syntax with complex semantics. You also get a
| faster feedback loop in case the syntax of your code is wrong
| vs the semantics (which might be undiscovered until runtime).
| drewr wrote:
| I don't understand your distinction between syntax and
| semantics. If the semantics are complex, wouldn't that mean
| the syntax is thus complex?
| SuperCuber wrote:
| lisp's syntax is simple - its just parenthesis to define a
| list, first element of a list is executed as a function.
|
| but for example a language like C has many different
| syntaxes for different operations, like function
| declaration or variable or array syntax, or if/switch-case
| etc etc.
|
| so to know C syntax you need to learn all these different
| ways to do different things, but in lisp you just need to
| know how to match parenthesis.
|
| But of course you still want to declare variables, or have
| if/else and switch case. So you instead need to learn the
| builtin macros (what GP means by semantics) and their
| "syntax" that is technically not part of the language's
| syntax but actually is since you still need all those
| operations enough that they are included in the standard
| library and defining your own is frowned upon.
| skydhash wrote:
| Most languages' abstract machines expose a very simple API,
| it's up to the language to add useful constructs to help us
| write code more efficiently. Languages like Lisp start with
| a very simple syntax, then add those constructs with the
| language itself (even though those can be fixed using a
| standard), others just add it through the syntax. These
| constructs plus the abstract machine's operations form the
| semantics, syntax is however the language designer decided
| to present them.
| __MatrixMan__ wrote:
| Jury's out re: whether I feel this in my gut. Need more time
| with the lisps for that. But re: cognitive load maybe it goes
| like:
|
| 1. 1 language to rule them all, fancy syntax
|
| 2. Many languages, 1 simple syntax to rule them all
|
| 3. Many languages and many fancy syntaxes
|
| Here in the wreckage of the tower of babel, 1. isn't really
| on the table. But 2. might have benefits because the
| inhumanity of the syntax need only be confronted once. The
| cumulative cost of all the competing opinionated fancy
| syntaxes may be the worst option. Think of all the hours lost
| to tabs vs spaces or braces vs whitespace.
| dartos wrote:
| I think 3 is not only a natural state, but the best state.
|
| I don't think we can have 1 language that satisfies the
| needs of all people who write code, and thus, we can't have
| 1 syntax that does that either.
|
| 3 seems the only sensible solution to me, and we have it.
| __MatrixMan__ wrote:
| I dunno, here in 3 the hardest part of learning a
| language has little to do with the language itself and
| more to do with the ecosystem of tooling around that
| language. I think we could more easily get on to the
| business of using the right language for the job if more
| of that tooling was shared. If each language, for
| instance did not have it's own package manager, its own
| IDE, its own linters and language servers all with their
| own idiosyncrasies arising not from deep philosophical
| differences of the associated language but instead from
| accidental quirks of perspective from whoever decided
| that their favorite language needed a new widget.
|
| I admire the widget makers, especially those wrangling
| the gaps between languages. I just wish their work could
| be made easier.
| skydhash wrote:
| I really like the Linux package managers. If you're going
| to write an application that will run on some system,
| it's better to bake dependencies into it. And with
| virtualization and containerization, the system is not
| tied to a physical machine. I've been using containers
| (incus) more and more for real development purposes as I
| can use almost the same environment to deploy. I don't
| care much about the IDE, but I'm glad we have LSP, Tree-
| sitter, and DAP. The one thing I do not like is the
| proliferation of tooling version manager (NVM,..) instead
| of managing the environment itself (tied to the project).
| nlitened wrote:
| I am surprised to hear that structural editing has been a
| hurdle for you, and I think I can offer a piece of advice. I
| also used to be terrified by its apparent complexity, but later
| found out that one just needs to use parinfer and to know key
| bindings for only three commands: slurp, barf, and raise.
|
| With just these four things you will be 95% there, enjoying the
| fruits of paredit without any complexity -- all the remaining
| tricks you can learn later when you feel like you're fluent.
| __MatrixMan__ wrote:
| Thanks very much for the advice, it's timely.
|
| <rant> It's not so much the editing itself but the
| unfamiliarity of the ecosystem. It seems it's a square-peg
| I've been crafting a round hole of habits for it:
|
| I guess I should use emacs? How to even configure it such
| that these actions are available? Or maybe I should write a
| plugin for helix so that I can be in a familiar environment.
| Oh, but the helix plugin language is a scheme, so I guess
| I'll use emacs until I can learn scheme better and then write
| that plugin. Oh but emacs keybinds are conflicting with what
| I've configured for zellij, maybe I can avoid conflicts by
| using evil mode? Oh ok, emacs-lisp, that's a thing. Hey symex
| seems like it aligns with my modal brain, oh but there goes
| another afternoon of fussing with emacs. Found and reported a
| symex "bug" but apparently it only appears in nix-governed
| environments so I guess I gotta figure out how to report the
| packaging bug (still todo). Also, I guess I might as well
| figure out how to get emacs to evaluate expressions based on
| which ones are selected, since that's one of the fun things
| you can do in lisps, but there's no plugin for the scheme
| that helix is using for its plugin language (which is why I'm
| learning scheme in the first place), but it turns out that AI
| is weirdly good at configuring emacs so now my emacs config
| contains most that that plugin would entail. Ok, now I'm
| finally ready to learn scheme, I've got this big list of new
| actions to learn: https://countvajhula.com/2021/09/25/the-
| animated-guide-to-sy.... Slurp, barf, and raise you say?
| excellent, I'll focus on those.
|
| I'm not actually trying to critique the unfamiliar space.
| These are all self inflicted wounds: me being persnickety
| about having it my way. It's just usually not so difficult to
| use something new and also have it my way.</rant>
| xenophonf wrote:
| I never bothered with structural editing on Emacs. I just
| use the sentence/paragraph movement commands. M-a, M-e,
| M-n, M-p, M-T, M-space, etc.
| nlitened wrote:
| To be fair, I am not a "lisper" and I don't know Emacs at
| all. I am just a Clojure enjoyer who uses IntelliJ +
| Cursive with its built-in parinfer/paredit.
| pxc wrote:
| > Oh but emacs keybinds are conflicting with what I've
| configured for zellij,
|
| Don't do that. ;)
|
| Emacs is a graphical application! Don't use it in the
| terminal unless you really have to (i.e., you're using it
| on a remote machine and TRAMP will not do).
|
| > it turns out that AI is weirdly good at configuring emacs
|
| I was just chatting with a friend about this. ChatGPT seems
| to be much better at writing ELisp than many other
| languages I've asked it to work with.
|
| Also while you're playing with it, you might be interested
| in checking out kakoune.el or meow, which provide modal
| editing in Emacs but with the selection-first ordering for
| commands, like in Kakoune and Helix rather than the old vi
| way.
|
| PS: symex looks really interesting! Hadn't been that one
| fanf2 wrote:
| Lisp has reader macros which allow you to reprogram its lexer.
| Lisp macros allow you to program the translation from the
| visible structure to the parse tree.
|
| For example, https://pyret.org/
|
| It really isn't simple or necessarily uniform.
| __MatrixMan__ wrote:
| I've heard that certain lisps (Common Lisp comes up when I
| search for reader macros) allow for all kinds of tinkering
| with themselves. But the ability of one to make itself not a
| lisp anymore, while interesting, doesn't seem to say much
| about the merits of sticking to s-expressions, except maybe
| to point out that somebody once decided not to.
| kazinator wrote:
| I don't think it's easy to write a good syntax coloring engine
| like the one in Vim.
|
| Syntax coloring has to handle context: different rules for
| material nested in certain ways.
|
| Vim's syntax higlighter lets you declare two kinds of items:
| matches and regions. Matches are simpler lexical rules, whereas
| regions have separate expressions for matching the start and end
| and middle. There are ways to exclude leading and trailing
| material from a region.
|
| Matches and regions can declare that they are contained. In that
| case they are not active unless they occur in a containing
| region.
|
| Contained matches declare which regions contain them.
|
| Regions declare which other regions they contain.
|
| That's the basic semantic architecture; there are bells and
| whistles in the system due to situations that arise.
|
| I don't think even Justine could develop that in an interview,
| other than as an overnight take home.
| kazinator wrote:
| Here is an example of something hard to handle: TXR language
| with embedded TXR Lisp.
|
| This is the "genman" script which takes the raw output of a
| manpage to HTML converter, and massages it to form the HTML
| version of the TXR manual:
|
| https://www.kylheku.com/cgit/txr/tree/genman.txr
|
| Everything that is white (not colored) is literal template
| material. Lisp code is embedded in directives, like @(do ...).
| In this scheme, TXR keywords appear purple, TXR Lisp ones
| green. They can be the same; see the (and ...) in line 149,
| versus numerous occurrences of @(and).
|
| Quasistrings contain nested syntax: see 130 where `<a href ..>
| ... </a>` contains an embedded (if ...). That could itself
| contain a quasistring with more embedded code.
|
| TXR's _txr.vim " and _tl.vim* syntax definition files are both
| generated by this:
|
| https://www.kylheku.com/cgit/txr/tree/genvim.txr
| saghm wrote:
| Naively, I would have assumed that the "correct" way to write a
| syntax highlighter would be to parse into an AST and then
| iterate over the tokens and update the color of a token based
| on the type of node (and maybe just tracking a diff to avoid
| needing to recolor things that haven't changed). I'm guessing
| that if this isn't done, it's for efficiency reasons (e.g. due
| to requiring parsing the whole file to highlight rather than
| just the part currently visible on the screen)?
| Someone wrote:
| > I would have assumed that the "correct" way to write a
| syntax highlighter would be to parse into an AST and then
| [...] I'm guessing that if this isn't done, it's for
| efficiency reasons
|
| It's not only running time, but also ease of implementation.
|
| A good syntax highlighter should do a decent job highlighting
| both valid and invalid programs (rationale: in most (editor,
| language) pairs, writing a program involves going through
| moments where the program being written isn't a valid
| program)
|
| If you decide to use an AST, that means you need to have good
| heuristics for turning invalid programs into valid ones that
| best mimic what the programmer intended. That can be
| difficult to achieve (good compilers have such heuristics,
| but even if you have such a compiler, chances are it isn't
| possible to reuse them for syntax coloring)
|
| If this simpler approach gives you most of what you can get
| with the AST approach, why bother writing that?
|
| Also, there are languages where some programs can't be
| perfectly parsed or syntax colored without running them. For
| those, you need this approach.
| susam wrote:
| > Every C programmers (sic) knows you can't embed a multi-line
| comment in a multi-line comment.
|
| And every Standard ML programmer might find this to be a
| surprising limitation. The following is a valid Standard ML
| program: (* (* Nested (**) *) comment *)
| val _ = print "hello, world\n"
|
| Here is the output: $ sml < hello.sml
| Standard ML of New Jersey (64-bit) v110.99.5 [built: Thu Mar 14
| 17:56:03 2024] - = hello, world $ mlton
| hello.sml && ./hello hello, world
|
| Given how C was considered one of the "expressive" languages when
| it arrived, it's curious that nested comments were never part of
| the language.
| dahart wrote:
| There are 3 things I find funny about that comment: ML didn't
| have single-line comments, so same level of surprising
| limitation. I've never heard someone refer to C as
| "expressive", but maybe it was in 1972 when compared to
| assembly. And what bearing does the comment syntax have on the
| expressiveness of a language? I would argue absolutely none at
| all, by _definition_. :P
| susam wrote:
| > ML didn't have single-line comments, so same level of
| surprising limitation.
|
| It is not quite clear to me why the lack of single-line
| comments is such a surprising limitation. After all, a
| single-line block comment can easily serve as a substitute.
| However, there is no straightforward workaround for the lack
| of nested block comments.
|
| > I've never heard someone refer to C as "expressive", but
| maybe it was in 1972 when compared to assembly.
|
| I was thinking of Fortran in this context. For instance,
| Fortran 77 lacked function pointers and offered a limited set
| of control flow structures, along with cumbersome support for
| recursion. I know Fortran, with its native support for
| multidimensional arrays, excelled in numerical and scientific
| computing but C quickly became the preferred language for
| general purpose computing.
|
| While very few today would consider C a pinnacle of
| expressiveness, when I was learning C, the landscape of
| mainstream programming languages was much more restricted. In
| fact, the preface to the first edition of K&R notes the
| following:
|
| _" In our experience, C has proven to be a pleasant,
| expressive and versatile language for a wide variety of
| programs."_
|
| C, Pascal, etc. stood out as some of the few mainstream
| programming languages that offered a reasonable level of
| expressiveness. Of course, Lisp was exceptionally expressive
| in its own right, but it wasn't always the best fit for
| certain applications or environments.
|
| > And what bearing does the comment syntax have on the
| expressiveness of a language?
|
| Nothing at all. I agree. The expressiveness of C comes from
| its grammar, which the language parser handles. Support for
| nested comments, in the context of C, is a concern for the
| lexer, so indeed one does not directly influence the other.
| However, it is still curious that a language with such a
| sophisticated grammar and parser could not allocate a bit of
| its complexity budget to support nested comments in its
| lexer. This is a trivial matter, I know, but I still couldn't
| help but wonder about it.
| dahart wrote:
| Fair enough. From my perspective, lack of single line
| comments is a little surprising because most other
| languages had it at the time (1973, when ML was
| introduced). Lack of nested comments doesn't seem
| surprising, because it isn't an important feature for a
| language, and because most other languages did not have it
| at the time (1972, when C was introduced).
|
| I can imagine both pro and con arguments for supporting
| nested comments, but regardless of what I think, C
| certainly could have added support for nested comments at
| any time, and hasn't, which suggests that there isn't
| sufficient need for it. That might be the entire
| explanation: not even worth a little complexity.
| masfuerte wrote:
| AFAIK, C didn't get single line comments until C99. They
| were a C++ feature originally.
| dahart wrote:
| Oh wow, I didn't remember that, and I did start writing C
| before 99. I stand corrected. I guess that is a little
| surprising. ;)
|
| Is true that many languages had single line comments?
| Maybe I'm forgetting more, but I remember everything else
| having single line comments... asm, basic, shell. I used
| Pascal in the 80s and apparently forgot it didn't have
| line comments either?
| masfuerte wrote:
| That's my recollection, that most languages had single
| line comments. Some had multi-line comments but C++ is
| the first I remember having syntaxes for both. That said,
| I'm not terribly familiar with pre-80s stuff.
| susam wrote:
| > C certainly could have added support for nested
| comments at any time
|
| After C89 was ratified, adding nested comments to C would
| have risked breaking existing code. For instance, this is
| a valid program in C89: #include
| <stdio.h> int main() { /* /* Comment
| */ printf("hello */ world"); return
| 0; }
|
| However, if a later C standard were to introduce nested
| comments, it would break the above program because then
| the following part of the program would be recognised as
| a comment: /* /* Comment */
| printf("hello */
|
| The above text would be ignored. Then the compiler would
| encounter the following: world");
|
| This would lead to errors like _undeclared identifier
| 'world'_, _missing terminating " character_, etc.
| pklausler wrote:
| > Fortran 77 lacked function pointers
|
| But we did have dummy procedures, which covered one of the
| important use cases directly, and which could be abused to
| fake function/subroutine pointers stored in data.
| gsliepen wrote:
| Well there is one way to nest comments in C, and that's by
| using #if 0: #if 0 This is a #if 0
| nested comment! #endif #endif
| fanf2 wrote:
| Except that text inside #if 0 still has to lex correctly.
|
| (unifdef has some evil code to support using C-style
| preprocessor directives with non-C source, which mostly boils
| down to ignoring comments. I don't recommend it!)
| dahart wrote:
| > Except that text inside #if 0 still has to lex correctly.
|
| Are you sure? I just tried on godbolt and that's not true
| with gcc 14.2. I've definitely put syntax errors
| intentionally into #if 0 blocks and had it compile. Are you
| thinking of some older version or something? I thought the
| pre-processor ran before the lexer since always...
| fanf2 wrote:
| There are three (relevant) phases (see "translation
| phases" in section 5 of the standard):
|
| * program is lexed into preprocessing tokens; comments
| turn into whitespace
|
| * preprocessor does its thing
|
| * preprocessor tokens are turned into proper tokens;
| different kinds of number are disambiguated; keywords and
| identifiers are disambiguated
|
| If you put an unclosed comment inside #if 0 then it won't
| work as you might expect.
| dahart wrote:
| Ah, I see. You're right!
| kragen wrote:
| This is not just true of Standard ML; it's also true of regular
| ML.
| layer8 wrote:
| Lexing nested comments requires maintaining a stack (or at
| least a nesting-level counter). That wasn't traditionally seen
| as being within the realm of lexical analysis, which would only
| use a finite-state automaton, like regular expressions.
| lupire wrote:
| > You'll notice its hash function only needs to consider a single
| character in in a string. That's what makes it perfect,
|
| Is that a joke?
|
| https://en.m.wikipedia.org/wiki/Perfect_hash_function
| playingalong wrote:
| Nice read.
|
| I guess the article could be called Falsehoods Programmers Assume
| of Programming Language Syntaxes.
| TomatoCo wrote:
| I think my favorite C trigraph was something like
| do_action() ??!??! handle_error()
|
| It almost looks like special error handling syntax but still
| remains satisfying once you realize it's an || logical-or
| statement and it's using short circuiting rules to execute handle
| error if the action returns a non-zero value.
| wslh wrote:
| Did you choose the legacy C trigraphs over || for aesthetic
| purposes?
| wslh wrote:
| Could you review my comment on HN? Please educate me if there
| is something I haven't understood, rather than downvoting my
| question.
| samatman wrote:
| The grandparent post is specifically about trigraphs.
| Saying something about trigraphs was the end-in-itself,
| trigraphs were chosen to illustrate something about
| trigraphs. So your question made no sense. Hope that helps.
| IshKebab wrote:
| I don't understand why you wouldn't use Tree Sitter's syntax
| highlighting for this. I mean it's not going to be as fast but
| that clearly isn't an issue here.
|
| Is this a "no third party dependencies" thing?
| jart wrote:
| I don't want to require everyone who builds llamafile from
| source need to install rust. I don't even require that people
| install the gperf command, since I can build gperf as a 700kb
| actually portable executable and vendor it in the repo. Tree
| sitter I'd imagine does a really great highly precise job with
| the languages it supports. However it appears to support fewer
| of them than I am currently. I'm taking a breadth first
| approach to syntax highlighting, due to the enormity of
| languages LLMs understand.
| IshKebab wrote:
| I think the Rust component of tree-sitter-highlight is
| actually pretty small (Tree Sitter generates C for the actual
| parser).
|
| But fair enough - fewer dependencies is always nice,
| especially in C++ (which doesn't have a modern package
| manager) and in ML where an enormous janky Python
| installation is apparently a perfectly normal thing to
| require.
| mdaniel wrote:
| I somehow thought Conan[1] was the C++ package manager;
| it's at least partially supported by GitLab, for what
| that's worth
|
| 1: https://docs.conan.io/2/introduction.html
| IshKebab wrote:
| No, if anything vcpkg is "the C++ package manager", but
| it's nowhere near pervasive and easy-to-use enough to
| come close to even Pip. It's leagues away from Cargo, Go,
| and other _actually good_ PL package managers.
| jim_lawless wrote:
| Forth has a default syntax, but Forth code can execute during the
| compilation process allowing it to accept/compile custom
| syntaxes.
| SonOfLilit wrote:
| Justine gets very close to the hairiest parsing issue in any
| language without encountering it:
|
| Perl's syntax is undecidable, because the difference between
| treating some characters as a comment or as a regex can depend on
| the type of a variable that is only determined e.g. based on
| whether a search for a Collatz counterexample terminates, or
| just, you know, user input.
|
| https://perlmonks.org/?node_id=663393
|
| C++ templates have a similar issue, I think.
| fanf2 wrote:
| I think possibly the most hilariously complicated instance of
| this is in perl's tokenizer, toke.c (which starts with a
| Tolkien quote, 'It all comes from here, the stench and the
| peril.' -- Frodo).
|
| There's a function called intuit_more which works out if
| $var[stuff] inside a regex is a variable interpolation followed
| by a character class, or an array element interpolation. Its
| result can depend on whether something in the stuff has been
| declared as a variable or not.
|
| But even if you ignore the undecidability, the rest is still
| ridiculously complicated.
|
| https://github.com/Perl/perl5/blob/blead/toke.c#L4502
| swolchok wrote:
| > C++ templates have a similar issue
|
| TIL! I went and dug up a citation:
| https://blog.reverberate.org/2013/08/parsing-c-is-literally-...
| layer8 wrote:
| How could a search for a Collatz counterexample possibly
| terminate? ;)
| petesergeant wrote:
| > Perl also has this goofy convention for writing man pages in
| your source code
|
| The world corpus of software would be much better documented if
| everywhere else had stolen this from Perl. Inline POD is great.
| kragen wrote:
| Perl and Python stole it from Emacs Lisp, though Perl took it
| further. I'm not sure where Java stole it from, but nowadays
| Doxygen is pretty common for C code. Unfortunately this results
| in people thinking that Javadoc and Doxygen are substitutes for
| actual documentation like the Emacs Lisp Reference Manual,
| which cannot be generated from docstrings, because the
| organization of the source code is hopelessly inadequate for a
| reference manual.
| mdaniel wrote:
| > Emacs Lisp Reference Manual, which cannot be generated from
| docstrings, because the organization of the source code is
| hopelessly inadequate for a reference manual.
|
| Well, they're not doing themselves any favors by just willy
| nilly mixing C with "user-facing" defuns <https://emba.gnu.or
| g/emacs/emacs/-/blob/ed1d691184df4b50da6b...>. I was curious
| if they could benefit from "literate programming" since
| OrgMode is _the bee 's knees_ but not with that style coding
| they can't
| metadat wrote:
| _> The languages I decided to support are Ada, Assembly, BASIC,
| C, C#, C++, COBOL, CSS, D, FORTH, FORTRAN, Go, Haskell, HTML,
| Java, JavaScript, Julia, JSON, Kotlin, ld, LISP, Lua, m4, Make,
| Markdown, MATLAB, Pascal, Perl, PHP, Python, R, Ruby, Rust,
| Scala, Shell, SQL, Swift, Tcl, TeX, TXT, TypeScript, and Zig._
|
| A few (admittedly silly) questions about the list:
|
| 1. Why no Erlang, Elixir, or Crystal?
|
| Erlang appears to be just at the author's boundary at #47 on the
| TIOBE index. https://www.tiobe.com/tiobe-index/
|
| 2. What is _" Shell"_? Sh, Bash, Zsh, Windows Cmd, PowerShell..?
|
| 3. Perl but no Awk? Curious why, because Awk is a similar but
| comparatively trivial language. Widely used, too.
|
| To be fair, Awk, Erlang, and Elixir rank low on popularity. Yet
| m4, Tcl, TeX, and Zig aren't registered in the top 50 at all.
|
| What's the methodology / criteria? Only things the author is
| already familiar with?
|
| Still a fun article.
| Yasuraka wrote:
| Tiobes's index is quite literally worthless, especially with
| regards to its stated purpose, let alone as a general point of
| orientation.
|
| I'd wish that purple would stop lending it any credibility.
| dakiol wrote:
| Wouldn't be possible to let the LLM do the highlighting? Instead
| of returning code in plain text, it could return code within html
| with the appropriate tags. Maybe it's harder than it sounds...
| but if it's just for highlighting the code the LLM returns, I
| wouldn't mind the highlighting not being 100% accurate.
| trashburger wrote:
| Would be much slower and eat up precious context window.
| layer8 wrote:
| The author may have missed that lexing C is actually context-
| sensitive, i.e. you need a symbol table:
| https://en.wikipedia.org/wiki/Lexer_hack
|
| Of course, for syntax highlighting this is only relevant if you
| want to highlight the multiplication operator differently from
| the dereferencing operator, or declarations differently from
| expressions.
|
| More generally, however, I find it useful to highlight (say)
| types differently from variables or functions, which in some
| (most?) popular languages requires full parsing and symbol table
| information. Some IDEs therefore implement two levels of syntax
| highlighting, a basic one that only requires lexical information,
| and an extended one that kicks in when full grammar and type
| information becomes available.
| legobmw99 wrote:
| I'd be shocked if jart didn't know this, but it seems unlikely
| that an LLM would generate one of these most vexing parses,
| unless explicitly asked
| layer8 wrote:
| Given all the things that were new to the author in the
| article, I wouldn't be shocked at all. There's just a huge
| number of things to know, or to have come across.
| quietbritishjim wrote:
| I think you're thinking of something different to the issue
| in the parent comment. The most vexing parse is, as the name
| suggests, a problem at the parsing stage rather than the
| earlier lexing phase. Unlike the referenced lexing problem,
| it does't require any hack for compilers to deal with it.
| That's because it's not really a problem for the compiler;
| it's humans that find it surprising.
| murkt wrote:
| Author hasn't tried to highlight TeX. Which is good for their
| mental health, I suppose, as it's generally impossible to fully
| highlight TeX without interpreting it.
|
| Even parsing is not enough, as it's possible to redefine what
| each character does. You can make it do things like "and now K
| means { and C means }".
|
| Yes, you can find papers on arXiv that use this god-forsaken
| feature.
| jart wrote:
| I wrote https://github.com/Mozilla-
| Ocho/llamafile/blob/main/llamafil... and it does a reasonable
| job highlighting without breaking for all the .tex files I
| could find on my hard drive. My goal is to hopefully cover
| 99.9% of real world usage, since that'll likely cover
| everything an LLM might output. Esoteric syntax also usually
| isn't a problem, so long as it doesn't cause strings and
| comments to extend forever, eclipsing the rest of the source
| code in a file.
| nathell wrote:
| Same with Common Lisp (you can redefine the read table),
| although that's likely abused less often on arXiv.
| bobbylarrybobby wrote:
| I couldn't believe it when I learned that \makeatletter does
| not "make (something) at a letter (character)" but rather
| "treats the '@' character as a letter when parsing".
| xonix wrote:
| No AWK?
| sundarurfriend wrote:
| The final line number count is missing Julia. Based on the file
| in the repo, it would be at the bottom of the first column:
| between ld and R.
|
| Among the niceties listed here, the one I'd wish for Julia to
| have would be C#'s "However many quotes you put on the lefthand
| side, that's what'll be used to terminate the string at the other
| end". Documentation that talks about quoting would be so much
| easier to read (in source form) with something like that.
| nusaru wrote:
| > Ruby is the union of all earlier languages, and it's not even
| formally documented.
|
| It's documented, but you need $250 to spare:
| https://www.iso.org/standard/59579.html
| mdaniel wrote:
| Well, according to (ahem) _a copy_ that I found, it only goes
| up to MRI 1.9 and goes out of its way to say "welp, the world
| is changing, so we're just going to punt until Ruby stabilizes"
| which is damn cheating for a _standard_ IMHO
|
| Also, while doing some digging I found there actually are a
| number of the standards that are legitimately publicly
| available
| https://standards.iso.org/ittf/PubliclyAvailableStandards/in...
___________________________________________________________________
(page generated 2024-11-02 23:00 UTC)