[HN Gopher] Emacs: Feature/tree-sitter merged into master
       ___________________________________________________________________
        
       Emacs: Feature/tree-sitter merged into master
        
       Author : signa11
       Score  : 220 points
       Date   : 2022-11-23 04:46 UTC (18 hours ago)
        
 (HTM) web link (lists.gnu.org)
 (TXT) w3m dump (lists.gnu.org)
        
       | Wonnk13 wrote:
       | So if I have a fairly unremarkable setup with LSP to give me
       | completions, what do I get by fooling around with tree-sitter. It
       | seems like this is more geared toward building an AST, so I'm not
       | sure how it would present itself to the end user currently?
        
         | omnicognate wrote:
         | Faster and more correct syntax highlighting is the main benefit
         | atm, as I understand it.
         | 
         | In general it's for functionality that needs to understand
         | syntax but doesn't need a full compilation-level understanding
         | of the code, that can benefit from much faster responses than
         | an LSP server can provide and that people may want working out
         | of the box for many languages without having to install and
         | configure language servers, generate compilation DBs, etc.
        
         | bergheim wrote:
         | You are correct. tree-sitter is not in competition with lsp;
         | lsp is project wide (different files), so will do say code
         | completion. tree-sitter is analyzing the current buffer and
         | applying things like highlighting, brackets etc.
         | 
         | lsp can do some of these tings as well, but sending the entire
         | buffer over to the lsp server every time you want to update the
         | buffer is an expensive operation. tree-sitter does it locally.
        
         | josteink wrote:
         | I think at this point it's a new building primitive mostly
         | aimed at major-mode authors.
         | 
         | That said, tree-sitter should make it possible to create
         | paredit-like implementations for languages not LISP and other
         | stuff like that, which IMO could turn out to be really neat.
         | 
         | As a change, this is quite significant, but not directly aimed
         | at end-users.
        
       | arc-in-space wrote:
       | Interesting. Current syntax highlighting in emacs is mostly fine,
       | except for how it occasionally blows up - an unterminated quote,
       | in some languages, can run out and match the entire tail as a
       | string, potentially freezing on a large file. Paredit() avoids
       | this by not even letting you do that unless you ask very nicely.
       | I wonder if tree-sitter helps there.
        
         | clircle wrote:
         | Maybe you and i are using different major modes, but i see the
         | warts of regex high lighting in eMacs. It's quite bad in auctex
        
       | aidenn0 wrote:
       | Incremental parsing of incorrect code is one of those things that
       | is literally impossible in the general case, but tree-sitter has
       | found a lot of good ways to do it that are not just possible for
       | a large fraction of reality, but also _performant_. It 's hard to
       | understate how impressive a piece of engineering this is.
        
       | Ian_Macharia wrote:
       | I use tree-sitter in neovim and the syntax highlighting is on par
       | with VSCode
        
       | mickeyp wrote:
       | If you're wondering what Tree-Sitter is and why Emacs would want
       | it, I wrote about it a while ago:
       | 
       | https://www.masteringemacs.org/article/tree-sitter-complicat...
        
         | afry1 wrote:
         | That point you make about syntax highlighting being slow while
         | using eglot/LSP-mode is a great one. I've been a bit
         | underwhelmed with eglot, and I think that must be the reason:
         | it feels like I'm programming in a bowl of oatmeal with every
         | keystroke.
         | 
         | Do you have any tips or guides for using treesitter for syntax
         | highlighting/structural editing and eglot/LSP-mode for
         | everything else?
        
           | omnicognate wrote:
           | AFAIK eglot/lsp-mode don't do syntax highlighting. The
           | article's just explaining why that is (i.e. because it would
           | be too slow).
           | 
           | If you don't have tree-sitter your syntax highlighting will
           | be done by the regex based font-lock-mode. I don't think
           | eglot/lsp-mode make that slower, and I believe tree-sitter
           | should speed it up (and make it more correct) without
           | affecting them. I haven't tried it yet, though.
        
             | TeMPOraL wrote:
             | It must be a matter of configuration. At work, I use buffer
             | re-fontification as an indicator that clangd correctly
             | processed the C++ source file I just opened. That's with an
             | Emacs built from source ~half a year ago + LSP mode.
        
               | omnicognate wrote:
               | Oh, interesting, it does appear lsp-mode now does
               | "semantic highlighting" if the server supports it. I
               | switched to eglot a while back (before it was added),
               | which doesn't.
               | 
               | I don't think I'd want that. Syntax highlighting and
               | indentation are things I want instant feedback from.
               | 
               | That affects the answer to the question. I assume you'd
               | need to persuade lsp-mode not to do this and leave it to
               | tree-sitter, but I don't know how to do that.
        
               | TeMPOraL wrote:
               | Last I checked, "semantic tokens" were opt-in; I think
               | it's still the case:
               | 
               | https://emacs-lsp.github.io/lsp-
               | mode/page/settings/semantic-...
        
         | sph wrote:
         | There is not an Emacs topic I'd like to know more about haven't
         | already covered on your website.
         | 
         | Thanks, your articles and your book are the best guides into
         | the world of Emacs.
        
           | mickeyp wrote:
           | Thank you :) I'm glad you like my site and my book!
        
       | davidkunz wrote:
       | Congratulations, Emacs! I hope it will be a similar success story
       | as in Neovim. If more systems use it, the question "should my
       | programming language provide a Tree-Sitter parser" becomes a no
       | brainer.
        
       | signa11 wrote:
       | For those unfamiliar with it, tree-sitter (https://emacs-tree-
       | sitter.github.io/) aims to be a foundational package that
       | understands code structurally (think abstract syntax trees). This
       | was done earlier via regex's, which has its limitations.
       | 
       | This talk: https://www.thestrangeloop.com/2018/tree-sitter---
       | a-new-pars... by the author is quite instructive as well.
        
         | tmalsburg2 wrote:
         | Is there a chance that this is going to make the parsing of
         | large org mode files faster?
        
           | AlanYx wrote:
           | It's not related to tree-sitter, but recent work on using
           | text properties instead of overlays for folded regions in org
           | has improved performance opening org files with folded
           | regions from O(n^2) to O(nlogn). See
           | https://blog.tecosaur.com/tmio/2022-05-31-folding.html It's a
           | big improvement in practice.
        
         | robenkleene wrote:
         | One thing I'll add, because I think it's an interesting insight
         | about the priorities of code parsing for text editors: Tree-
         | sitter is specifically designed to be very effective at parsing
         | code that's in an invalid state. E.g., think about adding a new
         | line to a program, the new line you're adding is typically
         | invalid for the majority of time until you've finished typing
         | it out.
        
         | josteink wrote:
         | This HN post though is about a (new) core tree-sitter
         | implementation in Emacs itself, which is not the same as the
         | third party package[1] you linked. To give credit where credit
         | is due though, it was obviously inspired by this work and what
         | it allowed in community-maintained packages.
         | 
         | The new implementation has been authored by Yuan Fu in close
         | collaboration with the core Emacs maintainers and the rest of
         | the community. It has been an ongoing effort for many, many
         | months.
         | 
         | This is great news, and means that also core Emacs language-
         | binding provided as part of Emacs itself will now be able to
         | make use of tree-sitter based parsers as well, something which
         | wouldn't have been happening if they would have to depend on a
         | third-party package to get those bindings.
         | 
         | I've been somewhat involved in the process, although not a
         | major player, but needless to say I'm very excited about these
         | news and can't wait to see what sort of improvements this
         | enables across the line once people start using it.
         | 
         | [1] https://github.com/emacs-tree-sitter/elisp-tree-sitter
        
           | phtrivier wrote:
           | So, we still have to wait for each major-mode mainteners to
           | update their code in order to benefits from those change ? In
           | this case, how big should the change be for a "typical" mode
           | ? Is it going to happen for C/python/typescript/etc.. anytime
           | soon ?
        
             | josteink wrote:
             | If you follow the Emacs-devel mailing list, you will see
             | many of the built in modes adds support for tree-sitter to
             | various degrees lots of languages are already on the list
             | (C, Python, Javascript, Bash, JSON and CSS).
             | 
             | It also includes some new language-modes which has never
             | been part of Emacs before (like typescript).
             | 
             | I'd love to see C# on the list, but that might depend on me
             | having the time to land production-grade major-mode, so
             | that might end up happening later rather than sooner.
             | 
             | Anyway, from what I understand what has been merged so far
             | should all be available as part of Emacs 29 once released.
        
       | erganemic wrote:
       | I'm really impressed with the strides Emacs has made recently:
       | native compilation, project.el, eglot, and now tree-sitter?
       | 
       | As a user who hadn't kept up with development news until
       | recently, I'd always mentally sorted Emacs into the same taxonomy
       | as stuff like `find`: old, powerful, with a clunky interface and
       | a stodgy resistance to updating how it does things (though not
       | without reason).
       | 
       | I'm increasingly feeling like that's an unfair classification on
       | my part--I'm genuinely super excited to see where Emacs is in 5
       | years.
        
         | zelphirkalt wrote:
         | I have the same feeling.
         | 
         | There is one more, possibly gigantic, thing though: Better
         | handling of very long strings. I know the data structures for
         | strings have various tradeoffs, but properly abstracted, it
         | should be possible to even give a choice, no? So users could
         | choose the data structure, based on their use cases. But I know
         | little about the internals and maybe that is all too low level
         | to be something a user could choose from the user interface or
         | configuration.
         | 
         | I hope string data structure is properly abstracted from, so
         | that it is exchangable for another data structure, but I have
         | my doubts. Would like to be surprised here and anyone credibly
         | telling me, that string data structure in Emacs has an
         | abstraction barrier around it, and is actually exchangable, by
         | implementing basic string functions like "get nth character" or
         | "get substring" in terms of another data structure.
         | 
         | If it is not properly abstracted from, then of course it could
         | be a nightmare to change the data structure.
        
           | b3morales wrote:
           | This was also something that was enhanced recently and will
           | be in Emacs 29: https://github.com/emacs-
           | mirror/emacs/blob/21b387c39bd9cf07c...
           | 
           | > Emacs is now capable of editing files with very long lines.
           | 
           | > The display of long lines has been optimized, and Emacs
           | should no
           | 
           | > longer choke when a buffer on display contains long lines.
           | 
           | > ...
        
         | ilyt wrote:
         | I use IntelliJ products but still prefer Emacs as an editor. I
         | moved off it for code for IDE features, even if I managed to
         | get some convenience in Emacs it ran synchronously which meant
         | experience could be pretty laggy at times vs "at worst popup
         | with extra info will be delayed" in IDEA
        
         | sph wrote:
         | Yes, it feels there is a lot of momentum going on recently.
         | 
         | Both neovim and Emacs are being improved at breakneck pace, and
         | it is quite incredible for such an old piece of software with,
         | dare I say, a quirky contribution model. The maintainers are
         | working really hard on keeping it current and competitive.
        
         | bloopernova wrote:
         | I'm really hoping that Emacs becomes multithreaded somehow. Or
         | at least improves some operations so that they're non-blocking.
         | 
         | I've been using Emacs primarily for org-mode/roam/babel for a
         | few years now. I'm very glad for its existence, I really think
         | I've become a more effective DevOps person because of it.
        
           | s0l1dsnak3123 wrote:
           | Indeed, I'm using Emacs for Code, reading/writing documents
           | and emails, as well as consuming RSS feeds. The ecosystem and
           | values that underpin Emacs are fantastic - in my personal
           | case the only downside to heavy use of Emacs is that it can
           | struggle to utilise my hardware. This tends to be
           | particularly noticeable when using TRAMP and Eglot, or
           | producing large org tables.
        
           | wyuenho wrote:
           | I'll be entirely satisfied with a process/event queue/loop
           | that we can submit tasks to like Javascript's. There is
           | already a command loop in Emacs, we just can't use it for
           | anything other than input events and commands. Once we have
           | an good event loop, we can build a state machine like Redux
           | on it, then we can start rebuilding the display machinery,
           | then we can start deleting all those hooks that constantly
           | interfere with each other...
        
           | ilyt wrote:
           | Yeah the extra micro-waits introduced by some IDE-like
           | features were annoying last time I used it.
        
             | deagle50 wrote:
             | Hope is on the horizon: https://old.reddit.com/r/emacs/comm
             | ents/ymrkyn/async_nonbloc...
        
               | rs_rs_rs_rs_rs wrote:
               | This is excellent!
        
           | tmalsburg2 wrote:
           | Emacs does have threads: https://www.gnu.org/software/emacs/m
           | anual/html_node/elisp/Th...
        
             | bloopernova wrote:
             | I probably didn't use the right terminology. I mean that if
             | I list-packages then U, then x to start updating, I should
             | be able to go back to my editor and continue working.
        
               | natrys wrote:
               | There was a package[1] that did exactly that, so it
               | should be technically possible, unfortunately it has been
               | unmaintained for a while. In any case I/O asynchronicity
               | is achievable without actual multithreading (there are
               | also IRC/telegram/matrix/mastodon clients that don't
               | freeze the UI).
               | 
               | [1] https://github.com/Malabarba/paradox
        
               | tmalsburg2 wrote:
               | I think a lot of packages are not yet using threads. And
               | to be honest, I'm a bit scared of packages starting to
               | use threads because there are a million ways in which you
               | can mess up with threads especially given Emacs'
               | architecture. What if two threads start manipulating the
               | same buffer? Emacs wasn't built with these scenarios in
               | mind. But perhaps I'm too pessimistic and there are good
               | answers for that.
        
               | TeMPOraL wrote:
               | I want to see good interactive tools for working with and
               | introspecting threads / async / other concurrency models
               | first. In general, because I don't know of any, and in
               | Emacs in particular.
               | 
               | My current experience with Emacs concurrency is mostly
               | negative - occasionally, an async-heavy package (like
               | e.g. Magit-style UI for Docker) will break, and I find it
               | hard to figure out why. Futures-heavy code I've seen
               | tends to keep critical data local (lexically let-bound),
               | which is the opposite of what you want in a malleable
               | system like Emacs. For example, I'd like to have a way to
               | list _all_ unresolved futures everywhere in Emacs, the
               | way I can with e.g. external processes. But it seems that
               | at least the async library used (aio, IIRC) is not
               | designed for that.
        
               | klibertp wrote:
               | > For example, I'd like to have a way to list all
               | unresolved futures everywhere in Emacs, the way I can
               | with e.g. external processes.
               | 
               | I think you could get this done by advising promise
               | creation/resolution functions, aio-promise and aio-
               | resolve. The async/await macros are wrappers around
               | generators-over-promises in this library.
               | 
               | But yes, in general Emacs concurrency sucks. The least
               | bad option I found was using promises' implementation
               | (chuntaro/emacs-promise) that uses `cl-defgeneric` for
               | `promise-then` and (obviously) moving as much processing
               | to a subprocess as possible. The former allows you to
               | make any type "thenable" by implementing the method for
               | it, which is nice for bundling the state around async
               | operations. cl-defstructs are nice for the purpose.
        
               | morelisp wrote:
               | I was sad the day I saw Emacs implemented threads before
               | a proper async event loop / futures / etc. Do those
               | first, see what kinds of concurrent code people actually
               | want to write, then write a multithreaded scheduler for
               | that.
               | 
               | Instead it's backwards, now we have hard-to-use
               | concurrency primitives and still shitty UIs.
        
             | wyuenho wrote:
             | Like the way Python have threads lol. Emacs has generators
             | too, and there are promises implemented on top of them, but
             | they aren't very useful in the elisp ecosystem because at
             | some point you are still going to have to poll due to a
             | lack of a JS like event loop that users can submit tasks
             | to.
        
         | bjourne wrote:
         | Check out the emacs-devel@gnu.org list sometime. It's
         | incredibly well run and is in my opinion the secret sauce that
         | keeps the project running.
        
       | jamborine wrote:
       | I'm so impressed by this
        
       | antipaul wrote:
       | What's the "explain it like I'm 5 years old" (ELI5) for tree-
       | sitter? Why should I, an emacs user but not lisp hacker, care
       | about it?
        
         | chriswarbo wrote:
         | tree-sitter creates parsers, e.g. for programming languages,
         | config formats, etc.
         | 
         | Emacs modes can use those parsers on buffer contents, e.g. for
         | syntax colouring/highlighting, finding matching delimiters
         | (e.g. moving the cursor over an `if`, and having all the
         | corresponding clauses (e.g. else/elif/fi) highlighted), for
         | contextual editing (e.g. escaping " when inside a string), etc.
         | 
         | This can be remarkably tricky to get right; e.g. consider
         | languages which can splice expressions inside strings (which
         | can themselves contain strings, containing spliced expressions,
         | etc.)
         | 
         | Using tree-sitter should make this easier and more robust (i.e.
         | less time spent implementing parsers; more time spent
         | implementing features!). I _think_ it would also allow grammars
         | to be re-used across different tools, which should improve
         | support for obscure /niche languages.
        
           | 2pEXgD0fZ5cF wrote:
           | Does this mean that every emacs language package would
           | automatically make use of this once it is built in. Or will
           | this rather enable the possibility to write/rewrite
           | programming language modes so they make use of tree-sitter
           | because they can assume it is available in the default emacs
           | install from then on?
        
             | omnicognate wrote:
             | It needs to be explicitly used. As far as I'm aware it
             | doesn't slot in behind an existing API and magically make
             | things better.
        
               | 2pEXgD0fZ5cF wrote:
               | Got it. Are there any beginner guides yet on how to write
               | an emacs (language) package while making use of it?
        
               | mdaniel wrote:
               | Unknown if this qualifies as "beginner guide" but the in-
               | tree document is titled "STARTER GUIDE ON WRITING MAJOR
               | MODE WITH TREE-SITTER": https://git.savannah.gnu.org/cgit
               | /emacs.git/tree/admin/notes...
        
         | giraffe_lady wrote:
         | You know how emacs typically has the worst syntax highlighting
         | of all mainstream editors for a given language? This makes it
         | better.
        
         | lawn wrote:
         | Another useful feature is that it makes it easier to support
         | mixing languages in the same file.
         | 
         | Think highlighting for html/JS/CSS in a single file or fully
         | featured highlighting inside markdown code snippets.
        
       | mcqueenjordan wrote:
       | I have a huge belief in tree-sitter. I think it's going to
       | continue to grow and become an important tool, especially in
       | security/code tooling contexts.
        
         | norir wrote:
         | The main innovation of tree-sitter, even more than incremental
         | parsing, as I see it is that it provides a uniform api for
         | traversing a parse tree, which makes it relatively
         | straightforward to onboard a new language to a tool with tree-
         | sitter support. The problem though is that the tree-sitter
         | grammar is nearly always going to be an approximation to the
         | actual language grammar, unless the language
         | compiler/interpreter uses tree-sitter for parsing. To me, this
         | is problematic for tooling because it is always possible for a
         | tree-sitter based tool to be flat out wrong relative to the
         | actual language. For syntax highlighting, this is generally not
         | a huge deal (and tree-sitter does generally work well, though
         | there are exceptions), but I'd be more cautious with security
         | tools based on tree-sitter.
         | 
         | If all languages changed their reference parsers to tree-
         | sitter, this would be moot, but that seems unlikely. Language
         | parsers are often optimized beyond what is possible in a
         | general purpose parser generator like tree-sitter and/or have
         | ambiguities that cannot be resolved with the tree-sitter dsl.
         | 
         | What feels perhaps likely in the future is that a standard
         | parse tree api emerges, analogous to lsp, and then language
         | parsers could emit trees traversable by this api. Maybe it's
         | just the tree-sitter c api with an alternate front end? Hard to
         | say, but I suspect either something better than (but likely at
         | least partially inspired by) tree-sitter will emerge or we will
         | get stuck in a local minimum with tooling based on slightly
         | incorrect language parsers.
        
           | debugnik wrote:
           | > unless the language compiler/interpreter uses tree-sitter
           | for parsing
           | 
           | Doubtful, last time I tried tree-sitter would parse invalid
           | inputs without even tagging any errors in the parse tree. For
           | example, it would silently accept extra tokens, or keywords
           | in the place of identifiers. Replacing the built-in lexer and
           | then validating the parse tree for correctness would be close
           | to writing the grammar twice.
           | 
           | And accepting partially correct inputs within the compiler
           | toolchain isn't too hard, so I don't really see the advantage
           | of agreeing on tree-sitter and not just on a parse tree
           | representation that editors can then query, as you then
           | suggested. If the big deal is having it execute client-side
           | or being sandboxed, I feel that's orthogonal to parsing
           | algorithms.
        
           | difflens wrote:
           | > as I see it is that it provides a uniform api for
           | traversing a parse tree, which makes it relatively
           | straightforward to onboard a new language to a tool with
           | tree-sitter support. The problem though is that the tree-
           | sitter grammar is nearly always going to be an approximation
           | to the actual language grammar, unless the language
           | compiler/interpreter uses tree-sitter for parsing.
           | 
           | Author of DiffLens (https://marketplace.visualstudio.com/item
           | s?itemName=DiffLens...) here. A uniform API for traversing a
           | parse tree for all languages would be amazing for DiffLens!
           | However, I fear languages are different enough that this
           | ideal may never be reached :) Or maybe there would be a core
           | set of APIs and extensions for the idiosyncrasies of each
           | language. For DiffLens though, we try to use the language's
           | official parser/compiler if it exposes an AST
        
           | cjohansson wrote:
           | tree-sitter is a bit better than regexp but it is not an
           | actual parser of grammars, a fast actual parser of all
           | languages for syntax coloring is the future I think, tree-
           | sitter is a pragmatic middle-ground while we wait for the
           | prime solution
        
       ___________________________________________________________________
       (page generated 2022-11-23 23:01 UTC)