[HN Gopher] Diffsitter - A Tree-sitter based AST difftool to get...
___________________________________________________________________
Diffsitter - A Tree-sitter based AST difftool to get meaningful
semantic diffs
Author : mihau
Score : 75 points
Date : 2025-07-10 12:51 UTC (10 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| fjfaase wrote:
| Discussed before on https://news.ycombinator.com/item?id=27875333
| koozz wrote:
| I thought I've seen it before. I use Difftastic myself, amazing
| diffs. https://github.com/Wilfred/difftastic
| jbellis wrote:
| If you're looking for something more complete and actively
| maintained, check out https://github.com/GumTreeDiff/gumtree.
|
| (I evaluated semantic diff tools for use in Brokk but I
| ultimately went with standard textual diff; the main hangup that
| I couldn't get past is that semantic diff understandably works
| very poorly when you have a syntactically invalid file due to an
| in-progress edit.)
| pests wrote:
| I watched a video long ago about how the Roslyn C# compiler
| handled this but I forget the details.
| pfdietz wrote:
| The interesting problem here would be how do you produce a
| robust parse tree for invalid inputs, in the sense of stably
| parsing large sections of the text in ways that don't change
| too much. The tree would have to be an extension of an actual
| parse tree, with nodes indicating sections that couldn't be
| fully parsed or had errors. The diff algorithm would have to
| also be robust in the face of such error nodes.
|
| For the parsing problem, maybe something like Early's algorithm
| that tries to minimize an error term?
|
| You need this kind of robust parser for languages with
| preprocessors.
| o11c wrote:
| Unfortunately, this depends on making good decisions during
| language design; it's not something you can retrofit with a
| new lexer and parser.
|
| One _very_ important rule is: no token can span more than one
| (possibly backslash-extended) line. This means having neither
| delimited comments (use multiple single-line comments; if
| your editor is too dumb for this you really need a new
| editor) nor multi-line strings (but you can do implicit
| concatenation of a string literal flavor that implicitly
| includes the newline; as a side-effect this fixes the
| indentation problem).
|
| If you don't follow this rule, you might as well give up on
| robustness, because how else are you going to ever
| resynchronize after an error?
|
| For parsing you can generally just aggressively pop on
| mismatched parens, unexpected semicolons, or on keywords only
| allowed in a top-ish level context. Of course, if your
| language is insane (like C typedefs), you might not be able
| to parse the next top-level function/class anyway. GNU
| statement-expressions, by contrast, are an actually useful
| thing that requires some thought. But again, language design
| choices can mitigate this (such as making classes values,
| template argument equivalent to array indexing, and
| statements expressions).
| pfdietz wrote:
| > how else are you going to ever resynchronize after an
| error?
|
| An error-cost-minimizing dynamic programming parser could
| do this.
| ilyagr wrote:
| In case anybody happens to be interested in testing `gumtree`
| with https://github.com/jj-vcs/jj, I think I got them to work
| together. See https://github.com/GumTreeDiff/gumtree/wiki/VCS-
| Integration#... (assumes Docker).
| affyboi wrote:
| Note that diffsitter isn't abandoned or anything. I took a year
| off working and just started a new job so I've been busy. I've
| got a laundry list of stuff I want to do with this project that
| will get done (at some point)
| the__alchemist wrote:
| Is there an anti-tree-sitter version too?
| davepeck wrote:
| yes, although it's sort of the same as Context-Free-Typing-
| sitter
| esafak wrote:
| Some make a semantic diff _splitter_ please! Break up big commits
| into small, atomic, meaningful ones.
| 0x457 wrote:
| Well, that's what git-patch is: https://patch-
| diff.githubusercontent.com/raw/denoland/deno/p...
| esafak wrote:
| I can't make sense of that link. How many parts was the diff
| split up into, and along what lines?
| 0x457 wrote:
| Yeah, I don't know why I linked that as an example. Wanted
| to show structure of a patch. Each commit of a patch
| already has everything ready to be processed and chunked IF
| you keep them - small, atomic, semantically meaningful. As
| in do smaller commits.
| ethan_smith wrote:
| Check out git-imerge or git-absorb which can help with this
| problem by intelligently splitting or absorbing changes into
| the right commits.
| pmkary wrote:
| What a genius idea.
| affyboi wrote:
| Nah I think most people could make something like this in a
| weekend
| vrm wrote:
| This is neat! I think in general there are really deep
| connections between semantically meaningful diffs (across
| modalities) and supervision of AI models. You might imagine a
| human-in-the-loop workflow where the human makes edits to a
| particular generation and then those edits are used as
| supervision for a future implementation of that thing. We did
| some related work here:
| https://www.tensorzero.com/blog/automatically-evaluating-ai-...
| on the coding use case but I'm interested in all the different
| approaches to the problem and especially on less structured
| domains.
| dcre wrote:
| See also https://mergiraf.org/ for a tool that uses ASTs to
| resolve (some) merge conflicts.
| Iwan-Zotow wrote:
| integration to VSCODE?
| 1-more wrote:
| See also difftastic
| https://difftastic.wilfred.me.uk/languages_supported.html
| ilyagr wrote:
| https://github.com/Wilfred/difftastic/wiki/Structural-Diffs is a
| nice list of alternatives.
|
| Difftastic itself is great as well! The author wrote up nice
| posts about its design:
| https://www.wilfred.me.uk/blog/2022/09/06/difftastic-the-fan...,
| https://difftastic.wilfred.me.uk/diffing.html.
___________________________________________________________________
(page generated 2025-07-10 23:00 UTC)