[HN Gopher] Diffsitter - A Tree-sitter based AST difftool to get...
       ___________________________________________________________________
        
       Diffsitter - A Tree-sitter based AST difftool to get meaningful
       semantic diffs
        
       Author : mihau
       Score  : 75 points
       Date   : 2025-07-10 12:51 UTC (10 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | fjfaase wrote:
       | Discussed before on https://news.ycombinator.com/item?id=27875333
        
         | koozz wrote:
         | I thought I've seen it before. I use Difftastic myself, amazing
         | diffs. https://github.com/Wilfred/difftastic
        
       | jbellis wrote:
       | If you're looking for something more complete and actively
       | maintained, check out https://github.com/GumTreeDiff/gumtree.
       | 
       | (I evaluated semantic diff tools for use in Brokk but I
       | ultimately went with standard textual diff; the main hangup that
       | I couldn't get past is that semantic diff understandably works
       | very poorly when you have a syntactically invalid file due to an
       | in-progress edit.)
        
         | pests wrote:
         | I watched a video long ago about how the Roslyn C# compiler
         | handled this but I forget the details.
        
         | pfdietz wrote:
         | The interesting problem here would be how do you produce a
         | robust parse tree for invalid inputs, in the sense of stably
         | parsing large sections of the text in ways that don't change
         | too much. The tree would have to be an extension of an actual
         | parse tree, with nodes indicating sections that couldn't be
         | fully parsed or had errors. The diff algorithm would have to
         | also be robust in the face of such error nodes.
         | 
         | For the parsing problem, maybe something like Early's algorithm
         | that tries to minimize an error term?
         | 
         | You need this kind of robust parser for languages with
         | preprocessors.
        
           | o11c wrote:
           | Unfortunately, this depends on making good decisions during
           | language design; it's not something you can retrofit with a
           | new lexer and parser.
           | 
           | One _very_ important rule is: no token can span more than one
           | (possibly backslash-extended) line. This means having neither
           | delimited comments (use multiple single-line comments; if
           | your editor is too dumb for this you really need a new
           | editor) nor multi-line strings (but you can do implicit
           | concatenation of a string literal flavor that implicitly
           | includes the newline; as a side-effect this fixes the
           | indentation problem).
           | 
           | If you don't follow this rule, you might as well give up on
           | robustness, because how else are you going to ever
           | resynchronize after an error?
           | 
           | For parsing you can generally just aggressively pop on
           | mismatched parens, unexpected semicolons, or on keywords only
           | allowed in a top-ish level context. Of course, if your
           | language is insane (like C typedefs), you might not be able
           | to parse the next top-level function/class anyway. GNU
           | statement-expressions, by contrast, are an actually useful
           | thing that requires some thought. But again, language design
           | choices can mitigate this (such as making classes values,
           | template argument equivalent to array indexing, and
           | statements expressions).
        
             | pfdietz wrote:
             | > how else are you going to ever resynchronize after an
             | error?
             | 
             | An error-cost-minimizing dynamic programming parser could
             | do this.
        
         | ilyagr wrote:
         | In case anybody happens to be interested in testing `gumtree`
         | with https://github.com/jj-vcs/jj, I think I got them to work
         | together. See https://github.com/GumTreeDiff/gumtree/wiki/VCS-
         | Integration#... (assumes Docker).
        
         | affyboi wrote:
         | Note that diffsitter isn't abandoned or anything. I took a year
         | off working and just started a new job so I've been busy. I've
         | got a laundry list of stuff I want to do with this project that
         | will get done (at some point)
        
       | the__alchemist wrote:
       | Is there an anti-tree-sitter version too?
        
         | davepeck wrote:
         | yes, although it's sort of the same as Context-Free-Typing-
         | sitter
        
       | esafak wrote:
       | Some make a semantic diff _splitter_ please! Break up big commits
       | into small, atomic, meaningful ones.
        
         | 0x457 wrote:
         | Well, that's what git-patch is: https://patch-
         | diff.githubusercontent.com/raw/denoland/deno/p...
        
           | esafak wrote:
           | I can't make sense of that link. How many parts was the diff
           | split up into, and along what lines?
        
             | 0x457 wrote:
             | Yeah, I don't know why I linked that as an example. Wanted
             | to show structure of a patch. Each commit of a patch
             | already has everything ready to be processed and chunked IF
             | you keep them - small, atomic, semantically meaningful. As
             | in do smaller commits.
        
         | ethan_smith wrote:
         | Check out git-imerge or git-absorb which can help with this
         | problem by intelligently splitting or absorbing changes into
         | the right commits.
        
       | pmkary wrote:
       | What a genius idea.
        
         | affyboi wrote:
         | Nah I think most people could make something like this in a
         | weekend
        
       | vrm wrote:
       | This is neat! I think in general there are really deep
       | connections between semantically meaningful diffs (across
       | modalities) and supervision of AI models. You might imagine a
       | human-in-the-loop workflow where the human makes edits to a
       | particular generation and then those edits are used as
       | supervision for a future implementation of that thing. We did
       | some related work here:
       | https://www.tensorzero.com/blog/automatically-evaluating-ai-...
       | on the coding use case but I'm interested in all the different
       | approaches to the problem and especially on less structured
       | domains.
        
       | dcre wrote:
       | See also https://mergiraf.org/ for a tool that uses ASTs to
       | resolve (some) merge conflicts.
        
       | Iwan-Zotow wrote:
       | integration to VSCODE?
        
       | 1-more wrote:
       | See also difftastic
       | https://difftastic.wilfred.me.uk/languages_supported.html
        
       | ilyagr wrote:
       | https://github.com/Wilfred/difftastic/wiki/Structural-Diffs is a
       | nice list of alternatives.
       | 
       | Difftastic itself is great as well! The author wrote up nice
       | posts about its design:
       | https://www.wilfred.me.uk/blog/2022/09/06/difftastic-the-fan...,
       | https://difftastic.wilfred.me.uk/diffing.html.
        
       ___________________________________________________________________
       (page generated 2025-07-10 23:00 UTC)