[HN Gopher] Weird Lexical Syntax
___________________________________________________________________
Weird Lexical Syntax
Author : jart
Score : 410 points
Date : 2024-11-02 07:45 UTC (1 days ago)
(HTM) web link (justine.lol)
(TXT) w3m dump (justine.lol)
| llm_trw wrote:
| I've done a fair bit of forth and I've not seen c" used. The
| usual string printing operator is ." .
| mananaysiempre wrote:
| Counted ("Pascal") strings are rare nowadays so C" is not often
| used. Its addr len equivalent is S" and that one is fairly
| common in string manipulation code.
| kragen wrote:
| Right, _c "_ is for when you want to pass a literal string to
| some other word, not print it. But I agree that it's not very
| common, because you normally use _s "_ for that, which leaves
| the address and length on the stack, while _c "_ leaves just an
| address on the stack, pointing to a one-byte count field
| followed by the bytes. I think adding _c "_ in Forth-83 (and
| renaming _"_ to _s "_) was a mistake, and it would have been
| better to deprecate the standard words that expect or produce
| such counted strings, other than _count_ itself. See
| https://forth-standard.org/standard/alpha, https://forth-
| standard.org/standard/core/Cq, https://forth-
| standard.org/standard/core/COUNT, and https://forth-
| standard.org/standard/core/Sq.
|
| You can easily add new string and comment syntaxes to Forth,
| though. For example, you can add BCPL-style // comments to end
| of line with this line of code in, I believe, all standard
| Forths, though I've only tested it in GForth:
| : // 10 word drop ; immediate
|
| Getting it to work in block files requires more work but is
| still only a few lines of code. The standard word _\_ does
| this, and _see \_ decompiles the GForth implementation as
| : \ blk @ IF >in @ c/l / 1+ c/l * >in !
| EXIT THEN source >in ! drop ; immediate
|
| This kind of thing was commonly done for text editor commands,
| for example; you might define _i_ as a word that reads text
| until the end of the line and inserts it at the current
| position in the editor, rather than discarding it like my //
| above. Among other things, the screen editor in F83 does
| exactly that.
|
| So, as with Perl, PostScript, TeX, m4, and Lisps that support
| readmacros, you can't lex Forth without executing it.
| skrebbel wrote:
| This was a delightful read, thanks!
| croisillon wrote:
| Glad to see confirmed that PHP is the most non weird programming
| language ;)
| rererereferred wrote:
| I recently learned php's heredoc can have space before it and
| it will remove those spaces from the lines in the string:
| $a = <<<EOL This is not indented
| but this has 4 spaces of indentation EOL;
|
| But the spaces have to match, if any line has less spaces than
| the EOL it gives an error.
| alganet wrote:
| There are two types of languages: the ones full of quirks and
| the ones no one uses.
| skitter wrote:
| Another syntax oddity (not mentioned here) that breaks most
| highlighters: In Java, unicode escapes can be anywhere, not just
| in strings. For example, the following is a valid class:
| class Foo\u007b}
|
| and this assert will not trigger: assert
| // String literals can have unicode escapes like \u000A!
| "Hello World".equals("\u00E4");
| ivanjermakov wrote:
| I have never seen this in Java! Is there any use cases where it
| could be useful?
| susam wrote:
| I don't know about usefulness but it does let us write
| identifiers using Unicode characters. For example:
| public class Foo { public static void main(String[]
| args) { double \u03c0 = 3.14159265;
| System.out.println("\u03c0 = " + \u03c0); } }
|
| Output: $ javac Foo.java && java Foo p
| = 3.14159265
|
| Of course, nowadays we can simply write this with any decent
| editor: public class Foo { public
| static void main(String[] args) { double p =
| 3.14159265; System.out.println("p = " + p);
| } }
|
| Support for Unicode escape sequences is a result of how the
| Java Language Specification (JLS) defines InputCharacter.
| Quoting from Section 3.4 of JLS
| <https://docs.oracle.com/javase/specs/jls/se23/jls23.pdf>:
| InputCharacter: UnicodeInputCharacter but not CR or
| LF
|
| UnicodeInputCharacter is defined as the following in section
| 3.3: UnicodeInputCharacter:
| UnicodeEscape RawInputCharacter
| UnicodeEscape: \ UnicodeMarker HexDigit HexDigit
| HexDigit HexDigit UnicodeMarker: u {u}
| HexDigit: (one of) 0 1 2 3 4 5 6 7 8 9 a b c
| d e f A B C D E F RawInputCharacter: any
| Unicode character
|
| As a result the lexical analyser honours Unicode escape
| sequences absolutely anywhere in the program text. For
| example, this is a valid Java program: public
| class Bar { public static void
| \u006d\u0061\u0069\u006e(String[] args) {
| System.out.println("hello, world"); } }
|
| Here is the output: $ javac Bar.java && java
| Bar hello, world
|
| However, this is an incorrect Java program:
| public class Baz { // This comment contains \u6d.
| public static void main(String[] args) {
| System.out.println("hello, world"); } }
|
| Here is the error: $ javac Baz.java
| Baz.java:2: error: illegal unicode escape // This
| comment contains \u6d.
| ^ 1 error
|
| Yes, this is an error even if the illegal Unicode escape
| sequence occurs in a comment!
| ivanjermakov wrote:
| I wonder if full unicode range was accepted because some
| companies are writing code in non-english.
| layer8 wrote:
| Javac uses the platform encoding [0] by default to interpret
| Java source files. This means that Java source code files are
| inherently non-portable. When Java was first developed (and
| for a long time after), this was the default situation for
| any kind of plain text files. The escape sequence syntax
| allows to transform [1] Java source code into a portable
| (that is, ASCII-only) representation that is completely
| equivalent to the original, and also to convert it back to
| any platform encoding.
|
| Source control clients could apply this automatically upon
| checkin/checkout, so that clients with different platform
| encodings can work together. Alternatively, IDEs could do
| this when saving/loading Java source files. That never quite
| caught on, and the general advice was to stick to ASCII, at
| least outside comments.
|
| [0] Since JDK 18, the default encoding defaults to UTF-8.
| This probably also extends to _javac_ , though I haven't
| verified it.
|
| [1] https://docs.oracle.com/javase/8/docs/technotes/tools/win
| dow...
| mistercow wrote:
| I also argue that failing to syntax highlight this correctly is
| a security issue. You can terminate block comments with Unicode
| escapes, so if you wanted to hide some malicious code in a Java
| source file, you just need an excuse for there to be a block of
| Unicode escapes in a comment. A dev who doesn't know about this
| quirk is likely to just skip over it, assuming it's commented
| out.
| styglian wrote:
| I once wrote a puzzle using this, which (fortunately) doesn't
| work any more, but would do interesting things on older JDK
| versions: https://pastebin.com/raw/Bh81PwXY
| mcphage wrote:
| At one point there was an open source project to formally specify
| Ruby, but I don't know if it's still alive:
| https://github.com/ruby/spec
|
| Hmm, it seems to be alive, but based more on behavior than
| syntax.
| keybored wrote:
| Meanwhile NeoVim doesn't syntax highlight my commit message
| properly if I have messed with "commit cleanup" enough.
|
| The comment character in Git commit messages can be a problem
| when you insist on prepending your commits with some "id" and the
| id starts with `#`. One suggestion was to allow backslash escapes
| in commit messages since that makes sense to a computer
| scientist.[1]
|
| But looking at all of this lexical stuff I wonder if makes-sense-
| to-computer-scientist is a good goal. They invented the problem
| of using a uniform delimiter for strings and then had to solve
| their own problem. Maybe it was hard to use backtick in the 70's
| and 80's, but today[2] you could use backtick to start a string
| and a single quote to end it.
|
| What do C-like programming languages use single quotes for? To
| quote characters. Why do you need to quote characters? I've never
| seen a literal character which needed an "end character" marker.
|
| Raw strings would still be useful but you wouldn't need raw
| strings just to do a very basic thing like make a string which
| has typewriter quotes in it.
|
| Of course this was for C-like languages. Don't even get me
| started on shell and related languages where basically everything
| is a string and you have to make a single-quote/double-quote
| battle plan before doing anything slightly nested.
|
| [1] https://lore.kernel.org/git/vpq3808p40o.fsf@anie.imag.fr/
|
| [2] Notwithstanding us Europeans that use a dead-key keyboard
| layout where you have to type twice to get one measly backtick
| (not that I use those)
| pwdisswordfishz wrote:
| > The comment character in Git commit messages can be a problem
| when you insist on prepending your commits with some "id" and
| the id starts with `#`
|
| https://git-scm.com/docs/git-commit#Documentation/git-commit...
| keybored wrote:
| See "commit cleanup".
|
| There's surprising layers to this. That the reporter in that
| thread says that git-commit will "happily" accept `#` in
| commit messages is half-true: it will accept it if you don't
| edit the message since the `default` cleanup (that you linked
| to) will not remove comments if the message is given through
| things like `-m` and not an editing session. So `git commit
| -m'#something' is fine. But then try to do rebase and cherry-
| pick and whatever else later, maybe get a merge commit
| message with a commented "conflicted" files. Well it can get
| confusing.
| kragen wrote:
| > _Maybe it was hard to use backtick in the 70's and 80's, but
| today[2] you could use backtick to start a string and a single
| quote to end it._
|
| That's how quoting works by default in m4 and TeX, both defined
| in the 70s. Unfortunately Unicode retconned the ASCII
| apostrophe character ' to be a vertical line, maybe out of a
| misguided deference to Microsoft Windows, and now we all have
| to suffer the consequences. (Unless we're using Computer Modern
| fonts or other fonts that predate this error, such as VGA font
| ROM dumps.)
|
| In the 70s and 80s, and into the current millennium on Unix,
| `x' did look like 'x', but now instead it looks like dogshit.
| Even if you are willing to require a custom font for
| readability, though, that doesn't solve the problem; you need
| some way to include an apostrophe in your quoted string!
|
| As for end delimiters, C itself supports multicharacter
| literals, which are potentially useful for things like
| Macintosh type and creator codes, or FTP commands.
| Unfortunately, following the Unicode botch theme, the standard
| failed to define an endianness or minimum width for them, so
| they're not very useful today. You can use them as enum values
| if you want to make your memory dumps easier to read in the
| debugger, and that's about it. I think Microsoft's compiler
| botched them so badly that even that's not an option if you
| need your code to run on it.
| ygra wrote:
| > Unfortunately Unicode retconned the ASCII apostrophe
| character ' to be a vertical line
|
| Unicode does not precribe the appearance of characters.
| Although in the code chart1 it says >>neutral (vertical)
| glyph with mixed usage<< (next to >>apostrophe-quote<< and
| >>single quote<<), font vendors have to deal with this mixed
| usage. And with Unicode the correct quotation marks have
| their own code points, making it unnecessary to design fonts
| where the ASCII apostrophe takes their form, but rendering
| all other uses pretty ugly.
|
| I would regard using ` and ' as paired quotation marks as a
| hack from times when typographic expression was simply not
| possible with the character sets of the day.
|
| _________
|
| 1 0027 ' APOSTROPHE = apostrophe-
| quote (1.0) = single quote = APL quote
| * neutral (vertical) glyph with mixed usage * 2019 '
| is preferred for apostrophe * preferred characters in
| English for paired quotation marks are 2018 ' & 2019 '
| * 05F3 ' is preferred for geresh when writing Hebrew
| - 02B9 ' modifier letter prime - 02BC ' modifier
| letter apostrophe - 02C8 ' modifier letter vertical
| line - 0301 $ combining acute accent - 030D $
| combining vertical line above - 05F3 ' hebrew
| punctuation geresh - 2018 ' left single quotation
| mark - 2019 ' right single quotation mark -
| 2032 ' prime - A78C latin small letter saltillo<<
| keybored wrote:
| > That's how quoting works by default in m4 and TeX, both
| defined in the 70s.
|
| Good point. And it was in m4[1] I saw that
| backtick+apostrophe syntax. I would have probably not thought
| of that possibility if I hadn't seen it there.
|
| [1] Probably on Wikipedia since I have never used it
|
| > Unfortunately Unicode retconned the ASCII apostrophe
| character ' to be a vertical line, maybe out of a misguided
| deference to Microsoft Windows, and now we all have to suffer
| the consequences. (Unless we're using Computer Modern fonts
| or other fonts that predate this error, such as VGA font ROM
| dumps.)
|
| I do think the vertical line looks subpar (and I don't use it
| in prose). But most programmers don't seem bothered by it. :|
|
| > In the 70s and 80s, and into the current millennium on
| Unix, `x' did look like 'x', but now instead it looks like
| dogshit.
|
| Emacs tries to render it like 'x' since it uses
| backtick+apostrophe for quotes. With some mixed results in my
| experience.
|
| > Even if you are willing to require a custom font for
| readability, though, that doesn't solve the problem; you need
| some way to include an apostrophe in your quoted string!
|
| Aha, I honestly didn't even think that far. Seems a bit
| restrictive to not be able to use possessives and
| contractions in strings without escapes.
|
| > As for end delimiters, C itself supports multicharacter
| literals, which are potentially useful for things like
| Macintosh type and creator codes, or FTP commands.
|
| I should have made it clear that I was only considering
| C-likes and not C itself. A language from the C trigraph days
| can be excused. To a certain extent.
| kragen wrote:
| I'd forgotten about `' in Emacs documentation! That may be
| influenced by TeX.
|
| C multicharacter literals are unrelated to trigraphs.
| Trigraphs were a mistake added many years later in the ANSI
| process.
| tom_ wrote:
| See also: https://www.cl.cam.ac.uk/~mgk25/ucs/quotes.html
| kragen wrote:
| This is an excellent document. I disagree with its
| normative conclusions, because I think being incompatible
| with ASCII, Unix, Emacs, and TeX is worse than being
| incompatible with ISO-8859-1, Microsoft Windows, and MacOS
| 9, but it is an excellent reference for the factual
| background.
| shawa_a_a wrote:
| The comment character is also configurable:
| git config core.commentchar <char>
|
| This is helpful where you want to use use say, markdown to have
| tidily formatted commit messages make up your pull request body
| too.
| keybored wrote:
| I want to try to set it to `auto` and see what spicy things
| it comes up with.
| samatman wrote:
| There are no problems caused by using unary delimiters for
| strings, because using paired delimiters for strings doesn't
| solve the problems unary delimiters create.
|
| By nature, strings contain arbitrary text. Paired delimiters
| have one virtue over unary: they nest, but this virtue is only
| evident when a syntax requires that they _must_ nest, and this
| is not the case for strings. It 's but a small victory to
| reduce the need for some sort of escaping, without eliminating
| it.
|
| Of the bewildering variety of partial solutions to the dilemma,
| none fully satisfactory, I consider the `backtick quote'
| pairing among the worst. Aside from the aesthetic problems,
| which can be fixed with the right choice of font, the bare
| apostrophe is much more common in plain text than an unmatched
| double quote, and the convention does nothing to help.
|
| This comes at the cost of losing a type of string, and backtick
| strings are well-used in many languages, including by you in
| your second paragraph. What we would get in return for this
| loss is, nothing, because `don't' is just as invalid as 'don't'
| and requires much the same solution. `This is `not worth it',
| you see', especially as languages like to treat strings as
| single tokens (many exceptions notwithstanding) and this
| introduces a push-down to that parse for, again, no appreciable
| benefit.
|
| I do agree with you about C and character literals, however.
| The close quote isn't needed and always struck me as somewhat
| wasteful. 'a is cleaner, and reduces the odds of typing "a"
| when you mean 'a'.
| yen223 wrote:
| select'select'select
|
| is a perfectly valid SQL query, at least for Postgres.
|
| Languages' approach to whitespace between tokens is all over the
| place
| notsylver wrote:
| As soon as I saw this was part of llamafile I was hoping that it
| would be used to limit LLM output to always be "valid" code as
| soon as it saw the backticks, but I suppose most LLMs don't have
| problems with that anyway. And I'm not sure you'd want something
| like that automatically forcing valid code anyway
| dilap wrote:
| llama.cpp does support something like this -- you can give it a
| grammar which restricts the set of available next tokens that
| are sampled over
|
| so in theory you could notice "```python" or whatever and then
| start restricting to valid python code. (in least in theory,
| not sure how feasible/possible it would be in practice w/ their
| grammar format.)
|
| for code i'm not sure how useful it would be since likely any
| model that is giving you working code wouldn't be struggling w/
| syntax errors anyway?
|
| but i have had success experimentally using the feature to
| drive fiction content for a game from a smaller llm to be in a
| very specific format.
| notsylver wrote:
| yeah, ive used llama.cpp grammars before, which is why i was
| thinking about it. i just think it'd be cool for llamafile to
| do basically that, but with included defaults so you could
| eg, require JSON output. it could be cool for prototyping or
| something. but i dont think that would be too useful anyway,
| most of the time i think you would want to restrict it to a
| specific schema, so i can only see it being useful for
| something like a tiny local LLM for code completion, but that
| would just encourage valid-looking but incorrect code.
|
| i think i just like the idea of restricting LLM output, it
| has a lot of interesting use cases
| dilap wrote:
| gotchya. i do think that is a cool idea actually -- LLMs
| tiny enough to do useful things with formally structured
| output but not big enough to nail the structure ~100% is
| probably not an empty set.
| pwdisswordfishz wrote:
| > Of all the languages, I've saved the best for last, which is
| Ruby. Now here's a language whose syntax evades all attempts at
| understanding.
|
| TeX with its arbitrarily reprogrammable lexer: how adorable
| fanf2 wrote:
| Lisp reader macros allow you to program its lexer too.
| skydhash wrote:
| You can basically define a new language with a few lines of
| code in Racket.
| pansa2 wrote:
| > _TypeScript, Swift, Kotlin, and Scala take string interpolation
| to the furthest extreme of encouraging actual code being embedded
| inside strings. So to highlight a string, one must count curly
| brackets and maintain a stack of parser states._
|
| Presumably this is also true in Python - IIRC the brace-delimited
| fields within f-strings may contain arbitrary expressions.
|
| More generally, this must mean that the lexical grammar of those
| languages isn't regular. "Maintaining a stack" isn't part of a
| finite-state machine for a regular grammar - instead we're in the
| realm of pushdown automata and context-free grammars.
|
| Is it even possible to support generalized string interpolation
| within a strictly regular lexical grammar?
| aphantastic wrote:
| > Is it even possible to support generalized string
| interpolation within a strictly regular lexical grammar?
|
| Almost certainly not, a fun exercise is to attempt to devise a
| Pumping tactic for your proposed language. If it doesn't exist,
| it's not regular.
|
| https://en.m.wikipedia.org/wiki/Pumping_lemma_for_regular_la...
| fanf2 wrote:
| Complicated interpolation can be lexed as a regular language if
| you treat strings as three separate lexical things, eg in
| JavaScript template literals there are,
| `stuff${ }stuff${ }stuff`
|
| so the ${ and } are extra closing and opening string
| delimiters, leaving the nesting to be handled by the parser.
|
| You need a lexer hack so that the lexer does not treat } as the
| start of a string literal, except when the parser is inside an
| interpolation but all nested {} have been closed.
| irdc wrote:
| I'd be interested to see a re-usable implementation of joe's[0]
| syntax highlighting.[1] The format is powerful enough to allow
| for the proper highlighting of Python f-strings.[2]
|
| 0. https://joe-editor.sf.net/
|
| 1. https://github.com/cmur2/joe-
| syntax/blob/joe-4.4/misc/HowItW...
|
| 2.
| https://gist.github.com/irdc/6188f11b1e699d615ce2520f03f1d0d...
| pama wrote:
| Interestingly, python f-strings changed their syntax at version
| 3.12, so highlighting should depend on the version.
| irdc wrote:
| It's just that nesting them arbitrarily is now allowed,
| right? That shouldn't matter much for a mere syntax
| highlighter then. And one could even argue that code that
| relies on this too much is not really for human consumption.
| pansa2 wrote:
| Also, you can now use the same quote character that
| encloses an f-string within the {} expressions. That could
| make them harder to tokenize, because it makes it harder to
| recognise the end of the string.
| akira2501 wrote:
| I've actually made several lexers and parsers based on the joe
| DFA style of parsing. The state and transition syntax was
| something that I always understood much more easily than the
| standard tools.
|
| The downside is your rulesets tend to get more verbose and are
| a little bit harder to structure than they might ideally be in
| other languages more suited towards the purpose, but I actually
| think that's an advantage, as it's much easier to reason about
| every production rule when looking at the code.
| rererereferred wrote:
| In the C# multiquoted strings, how does it know this:
| Console.WriteLine(""""""); Console.WriteLine("""""");
|
| Are 2 triplequoted empty strings and not one
| "\nConsole.WriteLine(" sixtuplequoted string?
| ygra wrote:
| The former, I'd say.
|
| https://learn.microsoft.com/en-us/dotnet/csharp/programming-...
|
| For a multi-line string the quotes have to be on their own
| line.
| Joker_vD wrote:
| If the opening quotes are followed by anything that is not a
| whitespace before the next new-line (or EOF), then it's a
| single-line string.
|
| I imagine implementing those things took several iterations :)
| yen223 wrote:
| It's a syntax error! Unterminated raw string
| literal.
|
| https://replit.com/@Wei-YenYen/DistantAdmirableCareware#main...
| Joker_vD wrote:
| Ah, so there is no backtracking in lexer for this case. Makes
| sense.
| ygra wrote:
| As for C#'s triple-quoted strings, they actually came from Java
| before and C# ended up adopting the same or almost the same
| semantics. Including stripping leading whitespace.
| pdw wrote:
| Some random things that the author seem to have missed:
|
| > but TypeScript, Swift, Kotlin, and Scala take string
| interpolation to the furthest extreme of encouraging actual code
| being embedded inside strings
|
| Many more languages support that: C#
| $"{x} plus {y} equals {x + y}" Python f"{x} plus
| {y} equals {x + y}" JavaScript `${x} plus ${y} equals
| ${x + y}` Ruby "#{x} plus #{y} equals #{x + y}"
| Shell "$x plus $y equals $(echo "$x+$y" | bc)"
| Make :) echo "$(x) plus $(y) equals $(shell echo "$x+$y" |
| bc)"
|
| > Tcl
|
| Tcl is funny because comments are only recognized in code, and
| since it's a homoiconic, it's very hard to distinguish code and
| data. { } are just funny string delimiters. E.g.:
| xyzzy {#hello world}
|
| Is xyzzy a command that takes a code block or a string? There's
| no way to tell. (Yes, that means that the Tcl tokenizer/parser
| cannot discard comments: only at evaluation time it's possible to
| tell if something is a comment or not.)
|
| > SQL
|
| PostgreSQL has the very convenient dollar-quoted strings:
| https://www.postgresql.org/docs/current/sql-syntax-lexical.h...
| E.g. these are equivalent: 'Dianne''s horse'
| $$Dianne's horse$$ $SomeTag$Dianne's horse$SomeTag$
| autarch wrote:
| Perl lets you do this too: my $foo = 5;
| my $bar = 'x'; my $quux = "I have $foo $bar\'s: @{[$bar
| x $foo]}"; print "$quux\n";
|
| This prints out: I have 5 x's: xxxxx
|
| The "@{[...]}" syntax is abusing Perl's ability to interpolate
| an _array_ as well as a scalar. The inner "[...]" creates an
| array reference and the outer "@{...}" dereferences it.
|
| For reasons I don't remember, the Perl interpreter allows
| arbitrary code in the inner "[...]" expression that creates the
| array reference.
| Izkata wrote:
| > For reasons I don't remember, the Perl interpreter allows
| arbitrary code in the inner "[...]" expression that creates
| the array reference.
|
| ...because it's an array value? Aside from how the languages
| handle references, how is that part any different from, for
| example, this in python: >>> [5 * 'x']
| ['xxxxx']
|
| You can put (almost) anything there, as long as it's an
| expression that evaluates to a value. The resulting value is
| what goes into the array.
| autarch wrote:
| I understand that's constructing an array. What's a bit odd
| is that the interpreter allows you to string interpolate
| any expression when constructing the array reference inside
| the string.
| Izkata wrote:
| It's not...? Well, not directly: It's string
| interpolating an array of values, and the array is
| constructed using values from the results of expressions.
| These are separate features that compose nicely.
| JadeNB wrote:
| > What's a bit odd is that the interpreter allows you to
| string interpolate any expression when constructing the
| array reference inside the string.
|
| Why? Surely it is easier for both the language and the
| programmer to have a rule for what you can do when
| constructing references to anonymous arrays, without
| having to special case whether that anonymous array is or
| is not in a string (or in any one of the many other
| contexts in which such a construct may appear in Perl).
| weinzierl wrote:
| You also don't need quotes around strings (barewords). So
| my $bar = x;
|
| should give the same result.
|
| Good luck with lexing that properly.
|
| https://perlmaven.com/barewords-in-perl
| shawn_w wrote:
| If you're writing anything approaching decent perl that
| won't be accepted.
| emmelaich wrote:
| "use strict" will prevent it and I think strict will be
| assumed/default soon.
| JadeNB wrote:
| As of Perl 5.12, `use`ing a version (necessary to ensure
| availability of some of the newer features) automatically
| implies `use strict`.
|
| https://perldoc.perl.org/strict#HISTORY
| weinzierl wrote:
| Doesn't really matter for a syntax highlighter, because
| it is out of your control what you get. For the llamafile
| highlighter even more so since it supports other legacy
| quirks, like C trigraphs as well.
| layer8 wrote:
| > actual code being embedded inside strings
|
| My view on this is that it shouldn't be interpreted as code
| being embedded inside strings, but as a special form of string
| concatenation syntax. In turn, this would mean that you can
| nest the syntax, for example: "foo {
| toUpper("bar { x + y } bar") } foo"
|
| The individual tokens being (one per line):
| "foo { toUpper ( "bar { x
| + y } bar" ) } foo"
|
| If `+` does string concatenation, the above would effectively
| be equivalent to: "foo " + toUpper("bar " +
| (x + y) + " bar") + " foo"
|
| I don't know if there is a language that actually works that
| way.
| panzi wrote:
| Indeed in some of the listed languages you can nest it like
| that, but in others (e.g. Python) you can't. I would guess
| they deliberately don't want to enable that and it's not a
| problem in their parser or something.
| layer8 wrote:
| Even when nesting is disallowed, my point is that I find it
| preferable to not view it (and syntax-highlight it) as a
| "special string" with embedded magic, but as multiple
| string literals with just different delimiters that allow
| omitting the explicit concatenation operator, and normal
| expressions interspersed in between. I think it's important
| to realize that it is really just very simple syntactic
| sugar for normal string concatenation.
| Timwi wrote:
| While you're conceptually right, in practice I think it
| bears mentioning that in C# the two syntaxes compile
| differently. This is because C#'s target platform, the
| .NET Framework, has always had a function called
| `string.Format` that lets you write this:
| var str = string.Format("{0} is {1} years old.", name,
| age);
|
| When interpolated strings were introduced later, it was
| natural to have them compile to this instead of
| concatenation.
| layer8 wrote:
| There's no reason in principle why name
| + " is " + age + " years old."
|
| couldn't compile to exactly the same. (Other than maybe
| `string.Format` having some additional customizable
| behavior, I don't know C# that well.)
| epcoa wrote:
| Like python, and Rust with the format! macro (which
| doesn't even support arbitrary expressions), C# the full
| syntax for interpolated/formatted strings is this: {<inte
| rpolationExpression>[,<alignment>][:<formatString>]}, ie
| there is more going on then just a simple wrapper around
| concat or StringBuilder.
| ygra wrote:
| When not using the format specifiers or alignment it will
| indeed compile to just string.Concat (which is also what
| the + operator for strings compiles to). Similar to C
| compilers choosing to call pits instead of printf if
| there is nothing to be formatted.
| epcoa wrote:
| If it's treated strictly as simple concatenation
| syntactic sugar then you are allowing something like
| print("foo { func() ); Which seems janky af.
|
| > just very simple syntactic sugar for normal string
| concatenation.
|
| Maybe. There's also possibly a string conversion. It
| seems reasonable to want to disallow implicit string
| conversion in a concatenation operator context
| (especially if overloading +) while allowing it in the
| interpolation case.
| layer8 wrote:
| I failed to mention the balancing requirement, that
| should of course remain. But it's an artificial
| requirement, so to speak, that is merely there to double-
| check the programmer's intent. The compiler/parser
| wouldn't actually care (unlike for an arithmetic
| expression with unbalanced parentheses, or scope blocks
| with unbalanced braces), the condition is only checked
| for the programmer's benefit.
|
| > here's also possibly a string conversion. It seems
| reasonable to want to disallow implicit string conversion
| in a concatenation operator context (especially if
| overloading +) while allowing it in the interpolation
| case.
|
| Many languages have a string contenation operator that
| does implicit conversion to string, while still having a
| string interpolation syntax like the above. It's kind of
| my point that both are much more similar to each other
| than many people seem to realize.
| Tarean wrote:
| As of python 3.6 you can nest fstrings. Not all formatters
| and highlighters have caught up, though.
|
| Which is fun, because correct highlighting depends on
| language version. Haskell has similar problems where
| different compiler flags require different parsers. Close
| enough is sufficient for syntax highlighting, though.
|
| Python is also a bit weird because it calls the format
| methods, so objects can intercept and react to the format
| specifiers in the f-string while being formatted.
| panzi wrote:
| I didn't mean nested f-strings. I mean this is a syntax
| error: >>> print(f"foo {"bar"}")
| SyntaxError: f-string: expecting '}'
|
| Only this works: >>> print(f"foo
| {'bar'}") foo bar
| pdw wrote:
| You're using an old Python version. On recent versions,
| it's perfectly fine: Python 3.12.7
| (main, Oct 3 2024, 15:15:22) [GCC 14.2.0] on linux
| Type "help", "copyright", "credits" or "license" for more
| information. >>> print(f"foo {"bar"}")
| foo bar
| epcoa wrote:
| > "foo { ...
|
| That should probably not be one token.
|
| > My view on this is that it shouldn't be interpreted as code
| being embedded inside strings
|
| I'm not sure exactly what you're proposing and how it is
| different. You still can't parse it as a regular lexical
| grammar.
|
| How does this change how you highlight either?
|
| Whatever you call it, to the lexer it is a special string, it
| has to know how to match it, the delimiters are materially
| different than concatenation.
|
| I might be being dense but I'm not sure what's formally
| distinct.
| layer8 wrote:
| > > "foo { ...
|
| > That should probably not be one token.
|
| It's exactly the point that this is one token. It's a
| string literal with opening delimiter `"` and closing
| delimiter `{`, and that whole token itself serves as a kind
| of opening "brace". Alternatively, you can see `{` as a
| contraction of `" +`. Meaning, aside from the brace
| balancing requirement, `"foo {` does the same a `"foo " +`
| would.
|
| Still alternatively, you could imagine a language that
| concatenates around string literals by default, similar to
| how C behaves for sequences of string literals. In C,
| "foo" "bar" "baz"
|
| is equivalent to "foobarbaz"
|
| Similarly, you could imagine a language where
| "foo" some_variable "bar"
|
| would perform implicit concatenation, without needing an
| explicit operator (as in `"foo" + x + "bar"`). And then
| people might write it without the inner whitespace, as:
| "foo"some_variable"bar"
|
| My point is that "foo{some_variable}bar"
|
| is really just that (plus a condition requiring balanced
| pairs of braces). You can also re-insert the spaces for
| emphasis: "foo{ some_variable }bar"
|
| The fact that people tend to think of `{some_variable}` as
| an entity is sort-of an illusion.
|
| > How does this change how you highlight either?
|
| You would highlight the `"...{`, `}...{`, and `}..."` parts
| like normal string literals (they just use curly braces
| instead of double quotes at one or both ends), and
| highlight the inner expressions the same as if they weren't
| surrounded by such literals.
| epcoa wrote:
| > It's exactly the point that this is one token.
|
| Fair enough. The point, as you have acknowledged, being
| that unlike + you have to treat { specially for balancing
| (and separately from the ").
|
| > The fact that people tend to think of `{some_variable}`
| as an entity is sort-of an illusion.
|
| I guess. I just don't know what being an illusion means
| formally. It's not an illusion to the person that has to
| implement the state machine that balances the delimiters.
|
| > You would highlight the `"...{`, `}...{`, and `}..."`
| parts like normal string literals (they just use curly
| braces instead of double quotes at one or both ends), and
| highlight the inner expressions the same as if they
| weren't surrounded by such literals
|
| Emacs does it this way FWIW. But I'm not sure how
| important it is to dictate that the brace can't be a
| different color.
|
| In any event, I can agree your design is valid (Kotlin
| works this way), but I don't necessarily agree it is any
| more valid than say how Python does it where there can
| format specifiers, implicit conversion to string is
| performed whereas not with concatenation. I'm not seeing
| the clear definitive advantage of interpolated strings
| being an equivalent to concatenation vs some other type
| of method call.
|
| The other detail is order of evaluation or sequencing.
| String concat may behave differently. Not sure I agree it
| is wrong, because at the end of the day it is distinct
| looking syntax. Illusion or not, it looks like a neatly
| enclosed expression, and concatenation looks like
| something else. That they might parse, evaluate or behave
| different isn't unreasonable.
| panzi wrote:
| Is this a bash-ism? "$x plus $y equals
| $((x+y))"
| jonahx wrote:
| This works in "sh" as well for me.
| panzi wrote:
| On some systems (like on mine) sh is just a link to bash,
| so I couldn't test it.
| Izkata wrote:
| Isn't bash supposed to act like sh when executed with
| that name?
| saagarjha wrote:
| It still has bashisms
| jwilk wrote:
| No, it's portable shell syntax.
| LukeShu wrote:
| "$((" arithmetic expansion is POSIX (XCU 2.6.4 "Arithmetic
| Expansion").
|
| But if I'm not mistaken, it originated in csh.
| susam wrote:
| > Is this a bash-ism?
|
| > "$x plus $y equals $((x+y))"
|
| No, it is specified in POSIX: https://pubs.opengroup.org/onli
| nepubs/9699919799/utilities/V...
| therein wrote:
| > PostgreSQL has the very convenient dollar-quoted strings
|
| I did not know that. Today I learned.
| sundarurfriend wrote:
| > Many more languages support that:
|
| Julia as well: Julia "$x plus $y equals
| $(x+y)"
| thesz wrote:
| VHDL
|
| There is a record constructor syntax in VHDL using attribute
| invocation syntax: RECORD_TYPE'(field1expr, ..., fieldNexpr).
| This means that if your record has a first field a subtype of a
| character type, you can get record construction expression like
| this one: REC'('0',1,"10101").
|
| Good luck distinguishing between '(' as a character literal and
| "'", "(" and "'0'" at lexical level.
|
| Haskell.
|
| Haskell has context-free syntax for bracketed ("{-" ... "-}")
| comments. Lexer has to keep bracketed comment syntax balanced
| (for every "{-" there should be accompanying "-}" somewhere).
| 1vuio0pswjnm7 wrote:
| Shell "$x plus $y equals $((x+y))"
|
| Shell "$x plus $y equals $((expr $x + $y))"
| Izkata wrote:
| Make :) echo "$(x) plus $(y) equals $(shell echo "$x+$y"
| | bc)"
|
| I'm guessing this is the reason for the :) but to be clear for
| anyone else: Make is only doing half of the work, whatever
| comes after "shell" is being passed to another executable, then
| make captures its stdout and interpolates that. The other
| executable is "sh" by default but can be changed to whatever.
| mbo wrote:
| > Scala
|
| Note about Scala's string interpolation. They can be used as
| pattern match targets. val s"${a} + ${b}" = "1
| + 2"; println(a) // 1 println(b) // 2
| vidarh wrote:
| Ruby takes this to 100. As much as a I love Ruby, this is valid
| Ruby, and I can't defend this: puts "This is
| #{<<HERE.strip} evil" incredibly HERE
|
| Just to combine the string interpolation with her concern over
| Ruby heredocs.
|
| My other favorite evil quirk in Ruby is that whitespace is a
| valid quote character in Ruby. The string (without the quotes)
| "% hello " is a quoted string containing "hello" (without the
| quotes), as "%" in contexts where there is no left operand
| initiates a quoted string and the next characters indicates the
| type of quotes. This is great when you do e.g. "%(this is a
| string)" or "%{this is a string}". It's not so great if you use
| space (I've _never_ seen that in the wild, so it 'd be nice if
| it was just removed - even irb doesn't handle it correctly)
| jart wrote:
| https://pbs.twimg.com/media/GbEfj6fbQAQRUB7?format=png&name=.
| ..
|
| That's so going in the blog post later today.
| vidarh wrote:
| Heh. I love Ruby, but, yes, the parser is "interesting",
| for values of interesting left undefined for its high
| obscenity content.
| mdaniel wrote:
| And don't overlook the fact that the bare-world, or its
| "HERE" friend, are still in an interpolation context, so...
| puts "hello #{<<onoz.strip} world" recursion is
| #{<<onoz.strip} recursive onoz onoz
| puts "that was fun"
|
| yields hello recursion is recursive world
| that was fun
|
| and then there's its backtick friend puts
| "hello #{<<`onoz`.strip} world" date -u onoz
|
| coughs up hello Sun Nov 3 17:25:32 UTC
| 2024 world
|
| and for those trying out your percent-space trick, be aware
| that it only tolerates such a thing in a standalone
| expression context so puts (% hello )+"
| world" # or x = % hello # puts x
|
| because when I tried it "normally" I got $
| /usr/bin/ruby -e 'puts % hello + "world"'
| -e:1:in `<main>': undefined local variable or method `hello'
| for main:Object (NameError) $ /usr/bin/ruby -v
| ruby 2.6.10p210 (2022-04-12 revision 67958)
| [universal.x86_64-darwin21]
|
| but, at the intersection is "ruby parsing is the 15th circle
| of hell" ruby -e 'puts (% #{<<FOO.strip}
| )+ " world" hello FOO '
| cryptonector wrote:
| jq: "\\("hello" + "world")!!"
|
| I wish PG had dollar-bracket quoting where you have to use the
| closing bracket to close, that way vim showmatch would work
| trivially. Something like ${...}$.
| bastawhiz wrote:
| Python f-strings are kind of wild. They can even contain
| comments! They also have slightly different rules for parsing
| certain kinds of expressions, like := and lambdas. And until
| fairly recently, strings inside the expressions couldn't use
| the quote type of the f-string itself (or backslashes).
| __MatrixMan__ wrote:
| This was a fun read, but it left me a bit more sympathetic to the
| lisp perspective, which (if I've understood it) is that syntax,
| being not an especially important part of a language, is more of
| a hurdle than a help, and should be as simple and uniform as
| possible so we can focus on other things.
|
| Which is sort of ironic because learning how to do structural
| editing on lisps has absolutely been more hurdle than help so
| far, but I'm sure it'll pay off eventually.
| mqus wrote:
| Having a simple syntax might be fine for computers but syntax
| is mainly designed to be read and written by humans. Having a
| simple one like lisp then just makes syntactic discussions a
| semantic problem, just shifting the layers.
|
| And I think an complex syntax is far easier to read and write
| than a simple syntax with complex semantics. You also get a
| faster feedback loop in case the syntax of your code is wrong
| vs the semantics (which might be undiscovered until runtime).
| drewr wrote:
| I don't understand your distinction between syntax and
| semantics. If the semantics are complex, wouldn't that mean
| the syntax is thus complex?
| SuperCuber wrote:
| lisp's syntax is simple - its just parenthesis to define a
| list, first element of a list is executed as a function.
|
| but for example a language like C has many different
| syntaxes for different operations, like function
| declaration or variable or array syntax, or if/switch-case
| etc etc.
|
| so to know C syntax you need to learn all these different
| ways to do different things, but in lisp you just need to
| know how to match parenthesis.
|
| But of course you still want to declare variables, or have
| if/else and switch case. So you instead need to learn the
| builtin macros (what GP means by semantics) and their
| "syntax" that is technically not part of the language's
| syntax but actually is since you still need all those
| operations enough that they are included in the standard
| library and defining your own is frowned upon.
| kryptiskt wrote:
| Lisp has way more syntax, that doesn't cover any of the
| special forms. Knowing about application syntax doesn't
| help with understanding `let` syntax. Even worse, with
| macros, the amount of syntax is open-ended. That they all
| come in the form of S-expressions doesn't help a lot in
| learning them.
| skydhash wrote:
| Most languages' abstract machines expose a very simple API,
| it's up to the language to add useful constructs to help us
| write code more efficiently. Languages like Lisp start with
| a very simple syntax, then add those constructs with the
| language itself (even though those can be fixed using a
| standard), others just add it through the syntax. These
| constructs plus the abstract machine's operations form the
| semantics, syntax is however the language designer decided
| to present them.
| __MatrixMan__ wrote:
| Jury's out re: whether I feel this in my gut. Need more time
| with the lisps for that. But re: cognitive load maybe it goes
| like:
|
| 1. 1 language to rule them all, fancy syntax
|
| 2. Many languages, 1 simple syntax to rule them all
|
| 3. Many languages and many fancy syntaxes
|
| Here in the wreckage of the tower of babel, 1. isn't really
| on the table. But 2. might have benefits because the
| inhumanity of the syntax need only be confronted once. The
| cumulative cost of all the competing opinionated fancy
| syntaxes may be the worst option. Think of all the hours lost
| to tabs vs spaces or braces vs whitespace.
| dartos wrote:
| I think 3 is not only a natural state, but the best state.
|
| I don't think we can have 1 language that satisfies the
| needs of all people who write code, and thus, we can't have
| 1 syntax that does that either.
|
| 3 seems the only sensible solution to me, and we have it.
| __MatrixMan__ wrote:
| I dunno, here in 3 the hardest part of learning a
| language has little to do with the language itself and
| more to do with the ecosystem of tooling around that
| language. I think we could more easily get on to the
| business of using the right language for the job if more
| of that tooling was shared. If each language, for
| instance did not have it's own package manager, its own
| IDE, its own linters and language servers all with their
| own idiosyncrasies arising not from deep philosophical
| differences of the associated language but instead from
| accidental quirks of perspective from whoever decided
| that their favorite language needed a new widget.
|
| I admire the widget makers, especially those wrangling
| the gaps between languages. I just wish their work could
| be made easier.
| skydhash wrote:
| I really like the Linux package managers. If you're going
| to write an application that will run on some system,
| it's better to bake dependencies into it. And with
| virtualization and containerization, the system is not
| tied to a physical machine. I've been using containers
| (incus) more and more for real development purposes as I
| can use almost the same environment to deploy. I don't
| care much about the IDE, but I'm glad we have LSP, Tree-
| sitter, and DAP. The one thing I do not like is the
| proliferation of tooling version manager (NVM,..) instead
| of managing the environment itself (tied to the project).
| andai wrote:
| This is interesting. My first thought was that a language
| where more meaning is expressed in syntax could catch more
| errors at compile time. But there seems to be no reason why
| meaning encoded in semantics could not also be caught at
| compile time.
|
| The main benefit of putting things in the syntax seems to be
| that many errors would become visually obvious.
| broken-kebab wrote:
| The problem with this statement is that it assumes parsing-
| easiness as something universal, and stable. And this is
| certainly not true. You may believe syntax A is so much
| easier simply because it's the syntax you have been dealing
| with most of your career thus your brain is trained for it.
| On top of it a particular task can make a lot of difference:
| most people would agree that regex is simplification versus
| writing the same logic in usual if-then way for pattern
| matching in strings, but I'm not sure many would like to have
| their whole programs looking that way (but even that could be
| subjective, see APL).
| James_K wrote:
| I've always thought these complaints are really just a
| reflection of how stuck we are in the C paradigm. The idea
| that you have to edit programs as text is outdated IMO. It
| should be that your editor operates on the syntax tree of the
| source code. Once you do that, the code can be displayed in
| any way.
| mdaniel wrote:
| I also believe this, and we're actually about half way
| there via MPS <https://github.com/JetBrains/MPS#readme> but
| I'm _pretty sure_ that dream is dead until this LLM hype
| blows over, since LLMs are not going to copy-paste syntax
| trees until the other dream of a universal representation
| materializes[1]
|
| 1: There have been _several_ attempts at Universal ASTs,
| including (unsurprisingly) a JVM-centric one from JetBrains
| https://github.com/JetBrains/intellij-
| community/blob/idea/24...
| nlitened wrote:
| I am surprised to hear that structural editing has been a
| hurdle for you, and I think I can offer a piece of advice. I
| also used to be terrified by its apparent complexity, but later
| found out that one just needs to use parinfer and to know key
| bindings for only three commands: slurp, barf, and raise.
|
| With just these four things you will be 95% there, enjoying the
| fruits of paredit without any complexity -- all the remaining
| tricks you can learn later when you feel like you're fluent.
| __MatrixMan__ wrote:
| Thanks very much for the advice, it's timely.
|
| <rant> It's not so much the editing itself but the
| unfamiliarity of the ecosystem. It seems it's a square-peg
| I've been crafting a round hole of habits for it:
|
| I guess I should use emacs? How to even configure it such
| that these actions are available? Or maybe I should write a
| plugin for helix so that I can be in a familiar environment.
| Oh, but the helix plugin language is a scheme, so I guess
| I'll use emacs until I can learn scheme better and then write
| that plugin. Oh but emacs keybinds are conflicting with what
| I've configured for zellij, maybe I can avoid conflicts by
| using evil mode? Oh ok, emacs-lisp, that's a thing. Hey symex
| seems like it aligns with my modal brain, oh but there goes
| another afternoon of fussing with emacs. Found and reported a
| symex "bug" but apparently it only appears in nix-governed
| environments so I guess I gotta figure out how to report the
| packaging bug (still todo). Also, I guess I might as well
| figure out how to get emacs to evaluate expressions based on
| which ones are selected, since that's one of the fun things
| you can do in lisps, but there's no plugin for the scheme
| that helix is using for its plugin language (which is why I'm
| learning scheme in the first place), but it turns out that AI
| is weirdly good at configuring emacs so now my emacs config
| contains most that that plugin would entail. Ok, now I'm
| finally ready to learn scheme, I've got this big list of new
| actions to learn: https://countvajhula.com/2021/09/25/the-
| animated-guide-to-sy.... Slurp, barf, and raise you say?
| excellent, I'll focus on those.
|
| I'm not actually trying to critique the unfamiliar space.
| These are all self inflicted wounds: me being persnickety
| about having it my way. It's just usually not so difficult to
| use something new and also have it my way.</rant>
| xenophonf wrote:
| I never bothered with structural editing on Emacs. I just
| use the sentence/paragraph movement commands. M-a, M-e,
| M-n, M-p, M-T, M-space, etc.
| nlitened wrote:
| To be fair, I am not a "lisper" and I don't know Emacs at
| all. I am just a Clojure enjoyer who uses IntelliJ +
| Cursive with its built-in parinfer/paredit.
| pxc wrote:
| > Oh but emacs keybinds are conflicting with what I've
| configured for zellij,
|
| Don't do that. ;)
|
| Emacs is a graphical application! Don't use it in the
| terminal unless you really have to (i.e., you're using it
| on a remote machine and TRAMP will not do).
|
| > it turns out that AI is weirdly good at configuring emacs
|
| I was just chatting with a friend about this. ChatGPT seems
| to be much better at writing ELisp than many other
| languages I've asked it to work with.
|
| Also while you're playing with it, you might be interested
| in checking out kakoune.el or meow, which provide modal
| editing in Emacs but with the selection-first ordering for
| commands, like in Kakoune and Helix rather than the old vi
| way.
|
| PS: symex looks really interesting! Hadn't been that one
| cenamus wrote:
| Well, elisp probably accounts for like 85% of the lisp
| code on GH and co, so that'd make sense
| fanf2 wrote:
| Lisp has reader macros which allow you to reprogram its lexer.
| Lisp macros allow you to program the translation from the
| visible structure to the parse tree.
|
| For example, https://pyret.org/
|
| It really isn't simple or necessarily uniform.
| __MatrixMan__ wrote:
| I've heard that certain lisps (Common Lisp comes up when I
| search for reader macros) allow for all kinds of tinkering
| with themselves. But the ability of one to make itself not a
| lisp anymore, while interesting, doesn't seem to say much
| about the merits of sticking to s-expressions, except maybe
| to point out that somebody once decided not to.
| lispm wrote:
| Reader macros are there to program and configure the
| _reader_. The _reader_ is responsible for reading
| s-expressions into internal data structures. There are
| basically two main uses of reader-macros: data structures
| and reader control.
|
| A CL implementation will implement reading lists, symbols,
| numbers, arrays, strings, structures, characters,
| pathnames, ... via reader macros. Additionally the reader
| implements various forms of control operations: conditional
| reading, reading and evaluation, circular datastructures,
| quoting and comments.
|
| This is user programmable&configurable. Most uses will be
| in the two above categories: data structure syntax and
| control. For example we could add a syntax for hash tables
| to s-expressions. An example for a control extension would
| be to add support for named readtables. For example a
| Common Lisp implementation could add a readtable for
| reading s-expressions from Scheme, which has a slightly
| different syntax.
|
| Reader macros were optimized for implementing
| s-expressions, thus the mechanism isn't that convenient as
| a lexer/parser for actual programming languages. It's a a
| bit painful to do so, but possible.
|
| A typical reader macro usage, beyond the usage described
| above, is one which implements a different token or
| expression syntax. For example there are reader macros
| which parse infix expressions. This might be useful in Lisp
| code where arithmetic expressions can be written in a more
| conventional infix syntax. The infix reader macro would
| convert infix expressions into prefix data.
| lispm wrote:
| Is Pyret based on _reader macros_? I would think it 's much
| easier to use a syntax parser for that.
| kazinator wrote:
| I don't think it's easy to write a good syntax coloring engine
| like the one in Vim.
|
| Syntax coloring has to handle context: different rules for
| material nested in certain ways.
|
| Vim's syntax higlighter lets you declare two kinds of items:
| matches and regions. Matches are simpler lexical rules, whereas
| regions have separate expressions for matching the start and end
| and middle. There are ways to exclude leading and trailing
| material from a region.
|
| Matches and regions can declare that they are contained. In that
| case they are not active unless they occur in a containing
| region.
|
| Contained matches declare which regions contain them.
|
| Regions declare which other regions they contain.
|
| That's the basic semantic architecture; there are bells and
| whistles in the system due to situations that arise.
|
| I don't think even Justine could develop that in an interview,
| other than as an overnight take home.
| kazinator wrote:
| Here is an example of something hard to handle: TXR language
| with embedded TXR Lisp.
|
| This is the "genman" script which takes the raw output of a
| manpage to HTML converter, and massages it to form the HTML
| version of the TXR manual:
|
| https://www.kylheku.com/cgit/txr/tree/genman.txr
|
| Everything that is white (not colored) is literal template
| material. Lisp code is embedded in directives, like @(do ...).
| In this scheme, TXR keywords appear purple, TXR Lisp ones
| green. They can be the same; see the (and ...) in line 149,
| versus numerous occurrences of @(and).
|
| Quasistrings contain nested syntax: see 130 where `<a href ..>
| ... </a>` contains an embedded (if ...). That could itself
| contain a quasistring with more embedded code.
|
| TXR's _txr.vim " and _tl.vim* syntax definition files are both
| generated by this:
|
| https://www.kylheku.com/cgit/txr/tree/genvim.txr
| saghm wrote:
| Naively, I would have assumed that the "correct" way to write a
| syntax highlighter would be to parse into an AST and then
| iterate over the tokens and update the color of a token based
| on the type of node (and maybe just tracking a diff to avoid
| needing to recolor things that haven't changed). I'm guessing
| that if this isn't done, it's for efficiency reasons (e.g. due
| to requiring parsing the whole file to highlight rather than
| just the part currently visible on the screen)?
| Someone wrote:
| > I would have assumed that the "correct" way to write a
| syntax highlighter would be to parse into an AST and then
| [...] I'm guessing that if this isn't done, it's for
| efficiency reasons
|
| It's not only running time, but also ease of implementation.
|
| A good syntax highlighter should do a decent job highlighting
| both valid and invalid programs (rationale: in most (editor,
| language) pairs, writing a program involves going through
| moments where the program being written isn't a valid
| program)
|
| If you decide to use an AST, that means you need to have good
| heuristics for turning invalid programs into valid ones that
| best mimic what the programmer intended. That can be
| difficult to achieve (good compilers have such heuristics,
| but even if you have such a compiler, chances are it isn't
| possible to reuse them for syntax coloring)
|
| If this simpler approach gives you most of what you can get
| with the AST approach, why bother writing that?
|
| Also, there are languages where some programs can't be
| perfectly parsed or syntax colored without running them. For
| those, you need this approach.
| tomcam wrote:
| > I don't think even Justine could develop that in an interview
|
| Not so sure I'd put money on that opinion ;)
| susam wrote:
| > Every C programmers (sic) knows you can't embed a multi-line
| comment in a multi-line comment.
|
| And every Standard ML programmer might find this to be a
| surprising limitation. The following is a valid Standard ML
| program: (* (* Nested (**) *) comment *)
| val _ = print "hello, world\n"
|
| Here is the output: $ sml < hello.sml
| Standard ML of New Jersey (64-bit) v110.99.5 [built: Thu Mar 14
| 17:56:03 2024] - = hello, world $ mlton
| hello.sml && ./hello hello, world
|
| Given how C was considered one of the "expressive" languages when
| it arrived, it's curious that nested comments were never part of
| the language.
| dahart wrote:
| There are 3 things I find funny about that comment: ML didn't
| have single-line comments, so same level of surprising
| limitation. I've never heard someone refer to C as
| "expressive", but maybe it was in 1972 when compared to
| assembly. And what bearing does the comment syntax have on the
| expressiveness of a language? I would argue absolutely none at
| all, by _definition_. :P
| susam wrote:
| > ML didn't have single-line comments, so same level of
| surprising limitation.
|
| It is not quite clear to me why the lack of single-line
| comments is such a surprising limitation. After all, a
| single-line block comment can easily serve as a substitute.
| However, there is no straightforward workaround for the lack
| of nested block comments.
|
| > I've never heard someone refer to C as "expressive", but
| maybe it was in 1972 when compared to assembly.
|
| I was thinking of Fortran in this context. For instance,
| Fortran 77 lacked function pointers and offered a limited set
| of control flow structures, along with cumbersome support for
| recursion. I know Fortran, with its native support for
| multidimensional arrays, excelled in numerical and scientific
| computing but C quickly became the preferred language for
| general purpose computing.
|
| While very few today would consider C a pinnacle of
| expressiveness, when I was learning C, the landscape of
| mainstream programming languages was much more restricted. In
| fact, the preface to the first edition of K&R notes the
| following:
|
| _" In our experience, C has proven to be a pleasant,
| expressive and versatile language for a wide variety of
| programs."_
|
| C, Pascal, etc. stood out as some of the few mainstream
| programming languages that offered a reasonable level of
| expressiveness. Of course, Lisp was exceptionally expressive
| in its own right, but it wasn't always the best fit for
| certain applications or environments.
|
| > And what bearing does the comment syntax have on the
| expressiveness of a language?
|
| Nothing at all. I agree. The expressiveness of C comes from
| its grammar, which the language parser handles. Support for
| nested comments, in the context of C, is a concern for the
| lexer, so indeed one does not directly influence the other.
| However, it is still curious that a language with such a
| sophisticated grammar and parser could not allocate a bit of
| its complexity budget to support nested comments in its
| lexer. This is a trivial matter, I know, but I still couldn't
| help but wonder about it.
| dahart wrote:
| Fair enough. From my perspective, lack of single line
| comments is a little surprising because most other
| languages had it at the time (1973, when ML was
| introduced). Lack of nested comments doesn't seem
| surprising, because it isn't an important feature for a
| language, and because most other languages did not have it
| at the time (1972, when C was introduced).
|
| I can imagine both pro and con arguments for supporting
| nested comments, but regardless of what I think, C
| certainly could have added support for nested comments at
| any time, and hasn't, which suggests that there isn't
| sufficient need for it. That might be the entire
| explanation: not even worth a little complexity.
| masfuerte wrote:
| AFAIK, C didn't get single line comments until C99. They
| were a C++ feature originally.
| dahart wrote:
| Oh wow, I didn't remember that, and I did start writing C
| before 99. I stand corrected. I guess that is a little
| surprising. ;)
|
| Is true that many languages had single line comments?
| Maybe I'm forgetting more, but I remember everything else
| having single line comments... asm, basic, shell. I used
| Pascal in the 80s and apparently forgot it didn't have
| line comments either?
| masfuerte wrote:
| That's my recollection, that most languages had single
| line comments. Some had multi-line comments but C++ is
| the first I remember having syntaxes for both. That said,
| I'm not terribly familiar with pre-80s stuff.
| quietbritishjim wrote:
| Some C compilers supported it as an unofficial extension
| well before C99, so that could be why you didn't realise
| or don't remember. I think that included both Visual
| Studio (which was really a C++ compiler that could turn
| off the C++ bits) and GCC with GNU extensions enabled.
| susam wrote:
| > C certainly could have added support for nested
| comments at any time
|
| After C89 was ratified, adding nested comments to C would
| have risked breaking existing code. For instance, this is
| a valid program in C89: #include
| <stdio.h> int main() { /* /* Comment
| */ printf("hello */ world"); return
| 0; }
|
| However, if a later C standard were to introduce nested
| comments, it would break the above program because then
| the following part of the program would be recognised as
| a comment: /* /* Comment */
| printf("hello */
|
| The above text would be ignored. Then the compiler would
| encounter the following: world");
|
| This would lead to errors like _undeclared identifier
| 'world'_, _missing terminating " character_, etc.
| dahart wrote:
| Given the neighboring thread where I just learned that
| the lexer runs before the preprocessor, I'm not sure that
| would be the outcome. There's no reason to assume the
| comment terminator wouldn't be ignored in strings. And
| even today, you can safely write printf("hello //
| world\n"); without risking a compile error, right?
| susam wrote:
| > Given the neighboring thread where I just learned that
| the lexer runs before the preprocessor, I'm not sure that
| would be the outcome.
|
| That is precisely why nested comments would end up
| breaking the C89 code example I provided above. I
| elaborate this further below.
|
| > There's no reason to assume the comment terminator
| wouldn't be ignored in strings.
|
| There is no notion of "comment terminator in strings" in
| C. At any point of time, the lexer is reading either a
| string or a comment but never one within the other. For
| example, in C89, C99, etc., this is an invalid C program
| too: #include <stdio.h> int
| main() { /* Comment printf("hello */
| world"); return 0; }
|
| In this case, we wouldn't say that the lexer is "honoring
| the comment terminator in a string" because, at the point
| the comment terminator '*/' is read, there is no active
| string. There is only a comment that looks like this:
| /* Comment printf("hello */
|
| The double quotation mark within the comment is
| immaterial. It is simply part of the comment. Once the
| lexer has read the opening '/*', it looks for the
| terminating '*/'. This behaviour would hold even if
| future C standards were to allow nested comments, which
| is why nested comments would break the C89 example I
| mentioned in my earlier HN comment.
|
| > And even today, you can safely write printf("hello //
| world\n"); without risking a compile error, right?
|
| Right. But it is not clear what this has got to do with
| my concern that nested comments would break valid C89
| programs. In this printf() example, we only have an
| ordinary string, so obviously this compiles fine. Once
| the lexer has read the opening quotation mark as the
| beginning of a string, it looks for an unescaped
| terminating quotation mark. So clearly, everything until
| the unescaped terminating quotation mark is a string!
| pklausler wrote:
| > Fortran 77 lacked function pointers
|
| But we did have dummy procedures, which covered one of the
| important use cases directly, and which could be abused to
| fake function/subroutine pointers stored in data.
| michaelcampbell wrote:
| I was barely too young for this to make much of an impact at
| the time, (but older than many, perhaps most, here), I
| understand why C was considered a "high level language", but
| it still hits me weird, given today's context.
| gsliepen wrote:
| Well there is one way to nest comments in C, and that's by
| using #if 0: #if 0 This is a #if 0
| nested comment! #endif #endif
| fanf2 wrote:
| Except that text inside #if 0 still has to lex correctly.
|
| (unifdef has some evil code to support using C-style
| preprocessor directives with non-C source, which mostly boils
| down to ignoring comments. I don't recommend it!)
| dahart wrote:
| > Except that text inside #if 0 still has to lex correctly.
|
| Are you sure? I just tried on godbolt and that's not true
| with gcc 14.2. I've definitely put syntax errors
| intentionally into #if 0 blocks and had it compile. Are you
| thinking of some older version or something? I thought the
| pre-processor ran before the lexer since always...
| fanf2 wrote:
| There are three (relevant) phases (see "translation
| phases" in section 5 of the standard):
|
| * program is lexed into preprocessing tokens; comments
| turn into whitespace
|
| * preprocessor does its thing
|
| * preprocessor tokens are turned into proper tokens;
| different kinds of number are disambiguated; keywords and
| identifiers are disambiguated
|
| If you put an unclosed comment inside #if 0 then it won't
| work as you might expect.
| dahart wrote:
| Ah, I see. You're right!
| kragen wrote:
| This is not just true of Standard ML; it's also true of regular
| ML.
| layer8 wrote:
| Lexing nested comments requires maintaining a stack (or at
| least a nesting-level counter). That wasn't traditionally seen
| as being within the realm of lexical analysis, which would only
| use a finite-state automaton, like regular expressions.
| akira2501 wrote:
| Pascal always supported the same nested comment syntax as your
| example.
| lupire wrote:
| > You'll notice its hash function only needs to consider a single
| character in in a string. That's what makes it perfect,
|
| Is that a joke?
|
| https://en.m.wikipedia.org/wiki/Perfect_hash_function
| jaen wrote:
| No. Taking the value of a single character is a correct perfect
| hash function, assuming there exists a position for the input
| string set where all characters differ.
| playingalong wrote:
| Nice read.
|
| I guess the article could be called Falsehoods Programmers Assume
| of Programming Language Syntaxes.
| TomatoCo wrote:
| I think my favorite C trigraph was something like
| do_action() ??!??! handle_error()
|
| It almost looks like special error handling syntax but still
| remains satisfying once you realize it's an || logical-or
| statement and it's using short circuiting rules to execute handle
| error if the action returns a non-zero value.
| wslh wrote:
| Did you choose the legacy C trigraphs over || for aesthetic
| purposes?
| wslh wrote:
| Could you review my comment on HN? Please educate me if there
| is something I haven't understood, rather than downvoting my
| question.
| samatman wrote:
| The grandparent post is specifically about trigraphs.
| Saying something about trigraphs was the end-in-itself,
| trigraphs were chosen to illustrate something about
| trigraphs. So your question made no sense. Hope that helps.
| Izkata wrote:
| Maybe the confusion was the other way, more like "why is
| that funny/interesting?"
|
| An attempt to answer that: In English, mixing ?! at the
| end of a question is a way of indicating bewilderment.
| Like "What was that?!"
| wslh wrote:
| My question was precisely about why the user like
| trigraphs over using just || on this case. It is a very
| clear question and makes all the sense.
| teo_zero wrote:
| I didn't downvote your comment but understand why it
| looks "wrong": it's like, in a thread on English
| oddities, you replied to someone bringing up the "buffalo
| buffalo buffalo" example with the question "why are you
| so fond of bovines"?
| wslh wrote:
| It has nothing to do with that. I could ask why he didn't
| choose a different homonymic ambiguity [1].
|
| [1] https://journals.linguisticsociety.org/proceedings/in
| dex.php...
| kergonath wrote:
| The post shows a "favorite C trigraph" thing, not that
| they were going out of their way to use trigraphs in
| actual code or that you should. Using trigraphs is the
| whole premise so no, your question makes no sense in that
| context.
|
| FWIW the ??!??! double trigraph as error processing is
| funny because of the meaning of ?! and various
| combinations of ? and !. It is funny and it has
| trigraphs. That's the whole point.
| mstade wrote:
| My reading of the downvoted question was one of genuine
| curiosity of why the author chose that as a favorite
| trigraph, as in "why that one instead of another", not as
| criticism of the choice of trigraph over something more
| conventional. I may be wrong of course, but it didn't
| seem like a particularly malicious question to me and
| your rationale unfortunately doesn't convince me
| otherwise. Not that it has to, this is all very
| subjective after all, but just offering up a counter
| opinion.
|
| I gave the question a +1 because I, as previously stated,
| read it to be genuine curiosity. Maybe a smiley would've
| helped, I don't know. -\\_(tsu)_/-
| wslh wrote:
| > The post shows a "favorite C trigraph" thing, not that
| they were going out of their way to use trigraphs in
| actual code or that you should.
|
| But I am free to be curious and ask the author why he
| choose it! We are not computers but human beings! There
| is no HN rule that says that I cannot be curious and asks
| a question that arised from a thread but it is not
| connected to that! [1].
|
| [1] https://www.iflscience.com/charles-babbage-once-sent-
| the-mos...
| JadeNB wrote:
| > So your question made no sense. Hope that helps.
|
| I think that is uncharitable. The question ("Did you
| choose the legacy C trigraphs over || for aesthetic
| purposes?") makes perfect sense to me. I think context
| makes it reasonably clear that the answer is 'yes,' but
| that doesn't mean that the question doesn't make sense,
| only perhaps that it didn't need to be asked.
| James_K wrote:
| Easiest way to get downvotes is to ask people not to give
| them. You just gotta ignore the haters.
| jacobn wrote:
| https://en.wikipedia.org/wiki/Digraphs_and_trigraphs_(progra...
|
| ??! Is converted to |
|
| So ??!??! Becomes || i.e. "or"
| IshKebab wrote:
| I don't understand why you wouldn't use Tree Sitter's syntax
| highlighting for this. I mean it's not going to be as fast but
| that clearly isn't an issue here.
|
| Is this a "no third party dependencies" thing?
| jart wrote:
| I don't want to require everyone who builds llamafile from
| source need to install rust. I don't even require that people
| install the gperf command, since I can build gperf as a 700kb
| actually portable executable and vendor it in the repo. Tree
| sitter I'd imagine does a really great highly precise job with
| the languages it supports. However it appears to support fewer
| of them than I am currently. I'm taking a breadth first
| approach to syntax highlighting, due to the enormity of
| languages LLMs understand.
| IshKebab wrote:
| I think the Rust component of tree-sitter-highlight is
| actually pretty small (Tree Sitter generates C for the actual
| parser).
|
| But fair enough - fewer dependencies is always nice,
| especially in C++ (which doesn't have a modern package
| manager) and in ML where an enormous janky Python
| installation is apparently a perfectly normal thing to
| require.
| mdaniel wrote:
| I somehow thought Conan[1] was the C++ package manager;
| it's at least partially supported by GitLab, for what
| that's worth
|
| 1: https://docs.conan.io/2/introduction.html
| IshKebab wrote:
| No, if anything vcpkg is "the C++ package manager", but
| it's nowhere near pervasive and easy-to-use enough to
| come close to even Pip. It's leagues away from Cargo, Go,
| and other _actually good_ PL package managers.
| mdaniel wrote:
| I knew that Microsoft used that on Windows but had no
| idea it was multi-platform: https://github.com/microsoft/
| vcpkg/releases/tag/2024.10.21 _(MIT, like a lot of their
| stuff)_
|
| Microsoft is such an odd duck, sometimes, but I'm glad to
| take advantage of their "good years" while it lasts
| chubot wrote:
| Have you developed against TreeSitter? Some feedback from
| people who use it here -
| https://news.ycombinator.com/item?id=39783471
|
| And here -
| https://lobste.rs/s/9huy81/tbsp_tree_based_source_processing...
| IshKebab wrote:
| Yes I have, and it worked very well for what I was using it
| for (assembly language LSP server). I didn't run into any of
| the issues they mentioned (not saying they don't exist
| though).
|
| For new projects I use Chumsky. It's a pure Rust parser which
| is nice because it means you avoid the generated C, and it
| also gives you a fully parsed and natively typed output,
| rather than Tree Sitter's dynamically typed tree of nodes,
| which means there's no extra parsing step to do.
|
| The main downside is it's more complicated to write the
| parser (some fairly extreme types). The API isn't stable yet
| either. But overall I like it more than Tree Sitter.
| jim_lawless wrote:
| Forth has a default syntax, but Forth code can execute during the
| compilation process allowing it to accept/compile custom
| syntaxes.
| SonOfLilit wrote:
| Justine gets very close to the hairiest parsing issue in any
| language without encountering it:
|
| Perl's syntax is undecidable, because the difference between
| treating some characters as a comment or as a regex can depend on
| the type of a variable that is only determined e.g. based on
| whether a search for a Collatz counterexample terminates, or
| just, you know, user input.
|
| https://perlmonks.org/?node_id=663393
|
| C++ templates have a similar issue, I think.
| fanf2 wrote:
| I think possibly the most hilariously complicated instance of
| this is in perl's tokenizer, toke.c (which starts with a
| Tolkien quote, 'It all comes from here, the stench and the
| peril.' -- Frodo).
|
| There's a function called intuit_more which works out if
| $var[stuff] inside a regex is a variable interpolation followed
| by a character class, or an array element interpolation. Its
| result can depend on whether something in the stuff has been
| declared as a variable or not.
|
| But even if you ignore the undecidability, the rest is still
| ridiculously complicated.
|
| https://github.com/Perl/perl5/blob/blead/toke.c#L4502
| ufo wrote:
| Wow. I wonder how that function came to be in the first
| place. Surely it couldn't have started out that complicated?
| swolchok wrote:
| > C++ templates have a similar issue
|
| TIL! I went and dug up a citation:
| https://blog.reverberate.org/2013/08/parsing-c-is-literally-...
| layer8 wrote:
| How could a search for a Collatz counterexample possibly
| terminate? ;)
| chubot wrote:
| Yup, bash and GNU Make have the same issue as Perl does, and I
| mention the C++ issue here too:
|
| _Parsing Bash is Undecidable_ -
| https://www.oilshell.org/blog/2016/10/20.html
|
| I remember a talk from Larry Wall on Perl 6 (now Raku), where
| he says this type of thing is a mistake. Raku can be statically
| parsed, as far as I know.
| jwilk wrote:
| Parsing POSIX shell in undecidable too:
|
| https://news.ycombinator.com/item?id=30362718
| chubot wrote:
| Yes, good point -- aliases makes parse time depend on
| runtime. That is mentioned in
|
| _Morbig: A static parser for POSIX shell_ - https://schola
| r.google.com/scholar?cluster=15754961728999604...
|
| (at the time I wrote the post about bash, I hadn't
| implemented aliases yet)
|
| But it's a little different since it is an intentional
| feature, not an acccident. It's designed to literally
| reinvoke the parser at runtime. I think it's not that
| good/useful a feature, and I tend to avoid it, but many
| people use it.
| petesergeant wrote:
| > Perl also has this goofy convention for writing man pages in
| your source code
|
| The world corpus of software would be much better documented if
| everywhere else had stolen this from Perl. Inline POD is great.
| kragen wrote:
| Perl and Python stole it from Emacs Lisp, though Perl took it
| further. I'm not sure where Java stole it from, but nowadays
| Doxygen is pretty common for C code. Unfortunately this results
| in people thinking that Javadoc and Doxygen are substitutes for
| actual documentation like the Emacs Lisp Reference Manual,
| which cannot be generated from docstrings, because the
| organization of the source code is hopelessly inadequate for a
| reference manual.
| mdaniel wrote:
| > Emacs Lisp Reference Manual, which cannot be generated from
| docstrings, because the organization of the source code is
| hopelessly inadequate for a reference manual.
|
| Well, they're not doing themselves any favors by just willy
| nilly mixing C with "user-facing" defuns <https://emba.gnu.or
| g/emacs/emacs/-/blob/ed1d691184df4b50da6b...>. I was curious
| if they could benefit from "literate programming" since
| OrgMode is _the bee 's knees_ but not with that style coding
| they can't
| kragen wrote:
| I didn't mean that specifically the Emacs source code was
| not organized in the right way for a reference manual. I
| meant that C and Java source code in general isn't. And
| C++, which is actually where people use Doxygen more.
|
| The Python standard library manual is also exemplary, and
| also necessarily organized differently from the source
| code.
| mdaniel wrote:
| > The Python standard library manual is also exemplary
|
| Maybe parts of it are, but as a concrete example
| https://docs.python.org/3/library/re.html#re.match is
| just some YOLO about what, _specifically_ , is the first
| argument to re.match: string, or compiled expression?
| Well, it's both! Huzzah! I guess they get points for
| consistency because the first argument to re.compile is
| also "both"
|
| But, any idea what type re.compile returns? cause
| https://docs.python.org/3/library/re.html#re.compile is
| all "don't you worry about it" versus its re.match friend
| who goes out of their way to state that it is an re.Match
| object
|
| Would it have been so hard to actually state it, versus
| requiring someone to invoke type() to get <class
| 're.Pattern'>?
| kragen wrote:
| I'm surprised to see that it's allowed to pass a compiled
| expression to re.match, since the regular expression
| object has a .match method of its own. To me the fact
| that the argument is called _pattern_ implies that it 's
| a string, because at the beginning of that chapter, it
| says, "Both patterns and strings to be searched can be
| Unicode strings ( _str_ ) as well as 8-bit strings (
| _bytes_ ). (...) Usually patterns will be expressed in
| Python code using this raw string notation."
|
| But this ability to pass a compiled regexp rather than a
| string can't have been an accidental feature, so I don't
| know why it isn't documented.
|
| Probably it would be good to have an example of invoking
| re.match with a literal string in the documentation item
| for re.match that you linked. There are sixteen such
| examples in the chapter, the first being re.match(r"(\w+)
| (\w+)", "Isaac Newton, physicist"), so you aren't going
| to be able to read much of the chapter without figuring
| out that you can pass a string there, but all sixteen of
| them come after that section. A useful example might be:
| >>> [s for s in ["", " ", "a ", " a", "aa"] if
| re.match(r'\w', s)] ['a ', 'aa']
|
| It's easy to make manuals worse by adding too much text
| to them, but in this case I think a small example like
| that would be an improvement.
|
| As for what type re.compile returns, the section you
| linked to says, "Compile a regular expression pattern
| into a regular expression object, which can be used for
| matching using its match(), search() and other methods,
| described below." Is your criticism that it doesn't
| explicitly say that the regular expression object is
| _returned_ (as opposed to, I suppose, stored in a table
| somewhere), or that it says "a regular expression
| object" instead of saying "an re.Pattern object"? Because
| the words "regular expression object" are a link to the
| "Regular Expression Objects" section, which begins by
| saying, "class re.Pattern: Compiled regular expression
| object returned by re.compile()." To me the name of the
| class doesn't seem like it adds much value here--to write
| programs that work using the re module, you don't need to
| know the name of the class the regular expression objects
| belong to, just what interface they support.
|
| (It's unfortunate that the class name is documented,
| because it would be better to rename it to a term that
| wasn't already defined to mean "a string that can be
| compiled to a regular expression object"!)
|
| But possibly I've been using the re module long enough
| that I'm blind to the deficiencies in its documentation?
|
| Anyway, I think documentation questions like this, about
| gradual introduction, forward references, sequencing,
| publicly documented (and thus stable) versus internal-
| only names, etc., are hard to reconcile with the
| constraints of source code, impossible in most languages.
| In this case the source code is divided between Python
| and C, adding difficulty.
| metadat wrote:
| _> The languages I decided to support are Ada, Assembly, BASIC,
| C, C#, C++, COBOL, CSS, D, FORTH, FORTRAN, Go, Haskell, HTML,
| Java, JavaScript, Julia, JSON, Kotlin, ld, LISP, Lua, m4, Make,
| Markdown, MATLAB, Pascal, Perl, PHP, Python, R, Ruby, Rust,
| Scala, Shell, SQL, Swift, Tcl, TeX, TXT, TypeScript, and Zig._
|
| A few (admittedly silly) questions about the list:
|
| 1. Why no Erlang, Elixir, or Crystal?
|
| Erlang appears to be just at the author's boundary at #47 on the
| TIOBE index. https://www.tiobe.com/tiobe-index/
|
| 2. What is _" Shell"_? Sh, Bash, Zsh, Windows Cmd, PowerShell..?
|
| 3. Perl but no Awk? Curious why, because Awk is a similar but
| comparatively trivial language. Widely used, too.
|
| To be fair, Awk, Erlang, and Elixir rank low on popularity. Yet
| m4, Tcl, TeX, and Zig aren't registered in the top 50 at all.
|
| What's the methodology / criteria? Only things the author is
| already familiar with?
|
| Still a fun article.
| Yasuraka wrote:
| Tiobes's index is quite literally worthless, especially with
| regards to its stated purpose, let alone as a general point of
| orientation.
|
| I'd wish that purple would stop lending it any credibility.
| Kwpolska wrote:
| "Shell" in the context of a syntax highlighting language picker
| almost always means a Unixy shell, most likely something along
| the lines of Bash.
| dakiol wrote:
| Wouldn't be possible to let the LLM do the highlighting? Instead
| of returning code in plain text, it could return code within html
| with the appropriate tags. Maybe it's harder than it sounds...
| but if it's just for highlighting the code the LLM returns, I
| wouldn't mind the highlighting not being 100% accurate.
| trashburger wrote:
| Would be much slower and eat up precious context window.
| layer8 wrote:
| The author may have missed that lexing C is actually context-
| sensitive, i.e. you need a symbol table:
| https://en.wikipedia.org/wiki/Lexer_hack
|
| Of course, for syntax highlighting this is only relevant if you
| want to highlight the multiplication operator differently from
| the dereferencing operator, or declarations differently from
| expressions.
|
| More generally, however, I find it useful to highlight (say)
| types differently from variables or functions, which in some
| (most?) popular languages requires full parsing and symbol table
| information. Some IDEs therefore implement two levels of syntax
| highlighting, a basic one that only requires lexical information,
| and an extended one that kicks in when full grammar and type
| information becomes available.
| legobmw99 wrote:
| I'd be shocked if jart didn't know this, but it seems unlikely
| that an LLM would generate one of these most vexing parses,
| unless explicitly asked
| layer8 wrote:
| Given all the things that were new to the author in the
| article, I wouldn't be shocked at all. There's just a huge
| number of things to know, or to have come across.
| jraph wrote:
| Justine is proficient in C, she is the author of a libc
| (cosmopolitan) among other things, like Actually Portable
| Executables [1].
|
| I would expect her to know C quite well, and that's
| probably an understatement.
|
| [1] https://justine.lol/ape.html
| quietbritishjim wrote:
| I think you're thinking of something different to the issue
| in the parent comment. The most vexing parse is, as the name
| suggests, a problem at the parsing stage rather than the
| earlier lexing phase. Unlike the referenced lexing problem,
| it does't require any hack for compilers to deal with it.
| That's because it's not really a problem for the compiler;
| it's humans that find it surprising.
| alekratz wrote:
| I don't think the lexer hack is relevant in this instance. The
| lexer hack just refers to the ambiguity of `A * B` and whether
| that should be parsed as a variable declaration or an
| expression. If you're building a syntax tree, then this
| matters, but AFAICT all the author needs is a sequence of
| tokens and not a syntax tree. Maybe "parser hack" would be a
| better name for it.
| teo_zero wrote:
| > this is only relevant if you want to highlight the
| multiplication operator differently from the dereferencing
| operator
|
| Can you mention one editor which does that?
| quietbritishjim wrote:
| I don't think they implied there is. The sentence you quoted
| is essentially "this is relevant for their article about
| weird lexical syntax, but (almost definitely) not relevant to
| their original problem of syntax highlighting".
| mdaniel wrote:
| I could be stretching the definition of "does" but the
| newfound(?) tree-sitter support in Emacs[1] I believe would
| allow that since it for sure understands the distinction but
| I don't possess enough font-lock ninjary to actually, for
| real, bind a different color to the distinct usages
| /* given foo.c */ int main() { int a, *b;
| a = 5 * 10; b = &a; printf("a is %d\n", *b);
| }
|
| and then M-x c-ts-mode followed by navigating to each * and
| invoking M-x treesit-inspect-node-at-point in turn produces,
| respectively: (declaration declarator:
| (pointer_declarator "*")) right:
| (binary_expression operator: "*") arguments:
| (argument_list (pointer_expression operator: "*"))
|
| 1: https://www.emacswiki.org/emacs/Tree-sitter
| teo_zero wrote:
| These examples are unambiguous. Try with something more
| spicy like return (A)*(B);
|
| which depends on A being a type or a variable.
| dummy7777 wrote:
| hey
| murkt wrote:
| Author hasn't tried to highlight TeX. Which is good for their
| mental health, I suppose, as it's generally impossible to fully
| highlight TeX without interpreting it.
|
| Even parsing is not enough, as it's possible to redefine what
| each character does. You can make it do things like "and now K
| means { and C means }".
|
| Yes, you can find papers on arXiv that use this god-forsaken
| feature.
| jart wrote:
| I wrote https://github.com/Mozilla-
| Ocho/llamafile/blob/main/llamafil... and it does a reasonable
| job highlighting without breaking for all the .tex files I
| could find on my hard drive. My goal is to hopefully cover
| 99.9% of real world usage, since that'll likely cover
| everything an LLM might output. Esoteric syntax also usually
| isn't a problem, so long as it doesn't cause strings and
| comments to extend forever, eclipsing the rest of the source
| code in a file.
| murkt wrote:
| Yes, when goal isn't to support 100% of all the weird stuff,
| then it's orders of magnitude easier!
| nathell wrote:
| Same with Common Lisp (you can redefine the read table),
| although that's likely abused less often on arXiv.
| bobbylarrybobby wrote:
| I couldn't believe it when I learned that \makeatletter does
| not "make (something) at a letter (character)" but rather
| "treats the '@' character as a letter when parsing".
| xonix wrote:
| No AWK?
| sundarurfriend wrote:
| The final line number count is missing Julia. Based on the file
| in the repo, it would be at the bottom of the first column:
| between ld and R.
|
| Among the niceties listed here, the one I'd wish for Julia to
| have would be C#'s "However many quotes you put on the lefthand
| side, that's what'll be used to terminate the string at the other
| end". Documentation that talks about quoting would be so much
| easier to read (in source form) with something like that.
| sundarurfriend wrote:
| One nicety that Julia does have that I didn't know about (or
| had forgotten) is nested multi-line comments.
| #= this one has a #= nested comment =#
| inside of it and that works fine! =#
| nusaru wrote:
| > Ruby is the union of all earlier languages, and it's not even
| formally documented.
|
| It's documented, but you need $250 to spare:
| https://www.iso.org/standard/59579.html
| mdaniel wrote:
| Well, according to (ahem) _a copy_ that I found, it only goes
| up to MRI 1.9 and goes out of its way to say "welp, the world
| is changing, so we're just going to punt until Ruby stabilizes"
| which is damn cheating for a _standard_ IMHO
|
| Also, while doing some digging I found there actually are a
| number of the standards that are legitimately publicly
| available
| https://standards.iso.org/ittf/PubliclyAvailableStandards/in...
| vidarh wrote:
| ISO Ruby is a tiny, dated subset of Ruby. I doubt you'll find
| much Ruby that conforms to it.
|
| The Ruby everyone uses is much better defined by RubySpec etc.
| via test cases, but that's not complete either.
| tomcam wrote:
| > If you ever want to confuse your coworkers, then one great way
| to abuse this syntax is by replacing the heredoc marker with an
| empty string
|
| Maybe I am in favor of the death penalty after all
| petters wrote:
| > I'm not sure who wants to be able to syntax highlight C at 35
| MB per second, but I am now able to do so
|
| Fast, but tcc *compiles* C to binary code at 29 MB/s on a really
| old computer: https://bellard.org/tcc/#speed Should be possible
| to go much faster but probably not needed
| BiteCode_dev wrote:
| Justine vs Bellard, that's a nice setup.
| transfire wrote:
| Impressive work!
|
| I am surprised Smalltalk and Prolog are in there though.
| fahrnfahrnfahrn wrote:
| While developing the syntax for a programming language in the
| early 80s, I discovered that allowing spaces in identifiers was
| unambiguous, e.g., upper left corner = scale factor * old upper
| left corner.
___________________________________________________________________
(page generated 2024-11-03 23:01 UTC)