[HN Gopher] Difftastic: Syntax-aware structured diff tool
       ___________________________________________________________________
        
       Difftastic: Syntax-aware structured diff tool
        
       Author : ingve
       Score  : 275 points
       Date   : 2021-07-08 06:03 UTC (16 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | SeriousM wrote:
       | That's awesome!
       | 
       | Much better than "guessed" syntax diff or even line diff.
        
       | pedrovhb wrote:
       | This is cool. I work with a large Python codebase that's been
       | around for a while (hasn't been that long since it's been
       | completely off Python 2). It's not a bad codebase, but naturally
       | it has almost no typing annotations, and other devs are just
       | starting to come around to the idea.
       | 
       | I love `mypy --strict` but it produces way too much noise with
       | all the other code, so I've been using a tool I made that runs
       | mypy over the changed files and then greps the output to filter
       | for lines that `git diff` points out have changed. It's pretty
       | rough and imperfect (doesn't catch errors that started appearing
       | in unchanged files, or unchanged lines it the same file, for
       | instance) but it's still quite helpful. I've been meaning to make
       | an improved version that runs mypy over the branch point, then
       | over my branch, and then maps new and changed lines between them
       | so it displays all errors that are new, but I haven't gotten
       | around to it yet. It'd be useful for other tools too, like
       | semgrep.
        
         | NateEag wrote:
         | I'd forgotten about this, but years ago I was working in a huge
         | PHP nightmare codebase.
         | 
         | So, I hacked together a pre-commit hook that blocked the commit
         | only if the configured style checker registered errors on lines
         | being added by the diff.
         | 
         | It never got very polished, but I wound up using it in two
         | codebases over the years.
         | 
         | https://github.com/NateEag/diff-check
        
         | zomglings wrote:
         | This sounds well in line with what I have been building - a
         | tool to take syntax-aware diffs across git commits:
         | https://github.com/bugout-dev/locust
         | 
         | It currently supports Python, Javascript, and Java. I like the
         | idea of mypy changes, as well.
        
           | pedrovhb wrote:
           | I'll have a look, thanks!
        
       | philipov wrote:
       | This seems like a really difficult issue to solve in the general
       | case, but I found that solving it for the specific case I had was
       | a tractable problem.
       | 
       | I have a need to diff the output of a query and compare it to the
       | last time it ran, to do regression testing. Just diffing the
       | resulting CSVs wasn't very useful, because I needed the ability
       | to do things like ignore new columns, and report the exact column
       | that had differences from the previous version.
       | 
       | I was able to do that by defining a primary key on which I could
       | outer join the two tables. Missing or new rows would be the ones
       | that didn't join, and then I could do a per-column comparison for
       | each row that did join.
        
       | Wilfred wrote:
       | Author here! Happy to answer any questions :)
        
       | Audiophilip wrote:
       | My favorite diff tool is the one shipped with Plastic SCM, Xdiff.
       | Since it's visual, it makes it very easy to see what changes have
       | been done to the file.
       | 
       | https://www.plasticscm.com/features/xmerge
        
         | omgtehlion wrote:
         | And theirs Git UI, gmaster, includes the same diff tech too. I
         | do not use it for everyday tasks, but when I need to understand
         | complex and/or big changes this is my go-to tool.
        
       | dan-robertson wrote:
       | I personally am much more excited by "sliders" than the
       | structure-aware diffs. Marking additions between [], it is the
       | difference between e.g.                 handle_case [some new
       | case over multiple lines       handle_case] some existing case
       | [  check_invariant();       }              function newFunc(){
       | ...       ]  check_invariant();       }
       | 
       | And                 [handle_case some         new case]
       | handle_case old case            [function newFunc(){         ...
       | check_invariant();       }]
        
         | Wilfred wrote:
         | I agree sliders are a problem, and I hope to have a solution
         | there.
         | 
         | Syntactic differs already do better because they understand
         | that parentheses/brackets are paired. Difftastic does OK with
         | this example: https://imgur.com/a/pVlVBo5
        
           | dan-robertson wrote:
           | Yeah I'm keen to see your solution.
           | 
           | FWIW, the formatting of the snippets I wrote above was as two
           | separate diffs for additions with the new additions (ie green
           | parts) represented with [square brackets].
        
       | mookid11 wrote:
       | I wrote diffr [0] for that purpose; it serves me well, especially
       | if your team makes code with long lines.
       | 
       | In my opinion, a simple approach that does NOT make any parsing
       | is more efficient (what about bugs in your parser? code with
       | syntax errors? also, how fast would the parser be?)
       | 
       | [0]: https://github.com/mookid/diffr
        
         | feanaro wrote:
         | Many of your concerns could be alleviated by using Tree-Sitter.
         | (https://tree-sitter.github.io/tree-sitter/)
        
           | pfdietz wrote:
           | Tree-sitter is great, but I find it could do a better job
           | with broken code. This is particularly important when parsing
           | things like C or C++ where the preprocessor makes it likely
           | that unpreprocessed code can't be parsed anyway.
        
       | tlamponi wrote:
       | Surely far from being as elaborate as the linked tool, but I use
       | the following git command a few dozen times daily:
       | git diff --word-diff=color --word-diff-regex='\w+'
       | 
       | I added two aliases to my .gitconfig, one for diff and one for
       | show:                 [alias]         word-show = show --word-
       | diff=color --word-diff-regex='\\w+'         word-diff = diff
       | --word-diff=color --word-diff-regex='\\w+'
       | 
       | Those small things improved development and reviewing a lot for
       | me!
       | 
       | If stuff moved around or got it's indentation changed I either
       | add `--color-moved` and/or `-w` (ignore whitespace changes) flags
       | to filter out extra noise.
       | 
       | Sometimes I need to use another regex though, e.g. a simple dot
       | `.` for match all with no greedy +
        
       | mikepurvis wrote:
       | I'd feel so much more motivation for checking out alternative
       | diff tools if there was a better story for integrating them with
       | the review tools in Github, GitLab, etc. I know there's nothing
       | anyone can do about that-- it's something the Git hosts
       | themselves have to enable, or I have to see enough benefit in it
       | to go to an dedicated review tool to make the bother of that
       | worthwhile.
       | 
       | I believe Gerrit has a pluggable diff-- is there anything more
       | broadly on improving this story?
        
         | Wilfred wrote:
         | Definitely!
         | 
         | I still look at diffs in the terminal pretty often, but all my
         | code reviews are in rendered HTML.
         | 
         | That said, there needs to be a credible tool before review
         | tools can adopt it! GitHub does a line-based diff with word-
         | based highlighting, which is probably the best you can do
         | without syntactic smarts.
        
           | mikepurvis wrote:
           | It would be neat if there was a way to supply a "diff hint"
           | or something right in your git commit metadata. Obviously the
           | receiver/reviewer/renderer can ultimately do whatever they
           | want, but it would helpful if I as the one preparing the
           | change could at least specify intent.
           | 
           | I guess projects like the kernel where the review system is
           | built around emailed patches kind of already get this for
           | free-- once committed, the change will be rendered according
           | to the local user's git settings, but during the review
           | itself, it will be a diff prepared by the change's author
           | that will be under discussion.
           | 
           | In a glorious future where GitLab has four different diff
           | options, it would be great if I could specify that I want it
           | to default to the hinted diff tool, falling back to my
           | preferred one if there is no hint.
        
       | bifftastic wrote:
       | I like the name
        
       | davidkunz wrote:
       | To ease the pain in conventional differs, we use a pre-commit
       | hook to format the source code (prettier). This way we only see
       | differences if something _actually_ changed.
        
         | conceptme wrote:
         | This worsens the problem especially in templates when the
         | nesting changes.
        
           | frafra wrote:
           | Good diff tools will only show you that the indentation
           | changed, not the line as a whole (Meld for example).
        
         | danuker wrote:
         | Prettier works for JS.
         | 
         | A similar Python tool is Black: https://github.com/psf/black
         | 
         | > Black makes code review faster by producing the smallest
         | diffs possible.
        
         | Wilfred wrote:
         | A syntactic differ like Difftastic is very helpful when your
         | codebase is autoformatted. Formatters often reflow code.
         | 
         | Given the code:                 foo(one, two, three);
         | 
         | If you add an argument and reformat:                 foo(
         | one,         two,         new,         three       );
         | 
         | A line-based diff can make it hard to spot what's changed.
        
           | maw wrote:
           | When I was still formatting code manually (... -ish; emacs
           | did a lot of the tedious work for me) I eventually settled on
           | a style very similar to your four-argument example, precisely
           | because it makes diffs easier to read.
           | 
           | For the same reason, I asciibetically sorted things when it
           | made sense.
           | 
           | Now I use prettier and black and I'm mostly satisfied by
           | them, but their reflow behavior puts the lie to "[black]
           | makes code review faster by producing the smallest diffs
           | possible."
        
         | Cthulhu_ wrote:
         | I think code formatting should be mandatory and one of the
         | first things you adopt in your project. Resist code style rule
         | changes as much as possible, and if you do, apply them across
         | the whole codebase in one go to avoid churn and noise in diffs
         | down the line.
         | 
         | And if you do make style changes, put them in a separate commit
         | at the very least so the diffs are cleaner and code reviews are
         | easier.
         | 
         | In my project I use gofmt (goimports) for back-end code and
         | prettier for front-end; I've configured my editor to apply
         | those on save, and a pre-commit hook to either run the
         | formatter, or error if the formatting is not according to the
         | spec.
         | 
         | One of Go's proverbs is "Gofmt's style is no one's favorite,
         | yet gofmt is everyone's favorite.". Consistency and low noise
         | is more important (in that case) than a specific code style
         | preference.
        
           | zamalek wrote:
           | > gofmt
           | 
           | gofmt is in a small set of formatters that disallows
           | configuration and choice, it's also the first to my
           | recollection. This is a feature because deciding on a coding
           | standard is bike shedding. It's also extremely aggressive,
           | and undoes pretty much any choice you may make in formatting
           | your code; a feature, again.
           | 
           | Not all languages are this fortunate. Some have configuration
           | (cargo fmt), others aren't aggressive enough (Roslyn).
        
           | llimllib wrote:
           | you might want to check out gofumpt too, if you haven't
           | already: https://github.com/mvdan/gofumpt
        
           | TotempaaltJ wrote:
           | And if you do decide to do single-commit massive style
           | changes, add the commits to an ignore revs file: http://git-
           | scm.com/docs/git-config#Documentation/git-config....
        
             | lpapez wrote:
             | Did not know that thing existed, super useful. Thanks a
             | ton.
        
       | vcmiraldo wrote:
       | That is a really difficult problem for more reasons than what
       | fits in this comment :) In fact, I got my PhD studying this very
       | problem (https://victorcmiraldo.github.io/data/MiraldoPhD.pdf).
       | 
       | I did not find any description of how your diffing algorithm
       | works nor how you represent a patch. I'd be really curious to
       | know more.
        
         | Wilfred wrote:
         | Wow, thank you for the pointer! I've added it to
         | https://github.com/Wilfred/difftastic/wiki/Structural-Diffs as
         | I'm trying to understand the other solutions in this space.
         | 
         | Difftastic does not create a patch or worry about merging.
         | That's a hard problem that I'm not trying to solve. Instead, it
         | builds two ASTs, then marks each node as unchanged or novel.
         | 
         | To compute the diff, I use a graph search. Each vertex
         | represents a position in both the left and right ASTs.
         | 
         | Suppose you're comparing A with X A.
         | 
         | Start node:                 Left: A   Right: X A             ^
         | ^
         | 
         | The possible next nodes are:
         | 
         | (1) Treat A on the left as novel.                 Left: A
         | Right: X A              ^         ^
         | 
         | (2) Treat X on the right as novel.                 Left: A
         | Right: X A             ^            ^
         | 
         | Both (1) and (2) are the same 'distance', but (2) is closer to
         | the end node, because there's a edge from (2) to the end that
         | marks A as unchanged.
         | 
         | I've implemented this using Dijkstra's algorithm. My graph is
         | directed and acyclic, so there are faster algorithms like
         | topological sort. However, I don't construct the whole graph in
         | advance (that would take O(N^2) memory) so instead I construct
         | the graph nodes as necessary.
         | 
         | (This is very similar to Autochrome, which I've linked in the
         | README. Autochrome has a worked example which is really
         | helpful.)
         | 
         | At some point I think I'll have to use A* search instead. If
         | there are more than 500 lines of code with lots of changes,
         | difftastic can take a few seconds to terminate due to the naive
         | graph search.
        
         | dan-robertson wrote:
         | I think the diffing is the "obvious" graph search algorithm
         | between trees, where a "tree" is a list of atoms or trees
         | (think lisp lists).
         | 
         | Basically to diff a tree of n top-level elements against one of
         | m elements, construct a graph where nodes lie on an (n+1)x(m+1)
         | grid. Each node (a,b) corresponds to having looked at a
         | elements of the first and matched them to b elements of the
         | second list. Add edges (a,b)->(a+1,b) for deletion;
         | (a,b)->(a,b+1) for insertion; and (a,b)->(a+1,b+1) for an inner
         | diff (ie basically this graph search problem again). Choose
         | weights to apply to node and now find the shortest path from
         | (0,0) to (n,m).
        
           | vcmiraldo wrote:
           | From you description it seems like we're just computing the
           | standard insert-delete tree-edit-distance. These tend to be
           | slow.
           | 
           | This implies that the patch language only supports insertion,
           | deletion and modification of nodes, which is a shame since
           | refactorings, moves and duplications are also common
           | operations in the source-code domain. Additionally, if the
           | patch language only supports insertion, deletion and
           | modification, the merging algorithm will perform poorly.
        
             | Wilfred wrote:
             | Yep, that's a fair description. Note that I'm not providing
             | a merge algorithm, just a pretty way of viewing changes.
             | 
             | I did look at modelling moves in an earlier prototype, but
             | it's incredibly hard to display the result in a coherent
             | way when there are also insertions. It was also easier to
             | drop it when I moved to Dijkstra.
             | 
             | As you can see in the screenshot in the readme, it does
             | support inserting tree nodes whilst preserving children,
             | which covers a ton of cases.
        
       | zokier wrote:
       | Gumtree diff is based on ast:
       | https://github.com/GumTreeDiff/gumtree
       | 
       | There is some discussion of using Treesitter for parsing, that
       | would potentially open door for many languages:
       | https://github.com/GumTreeDiff/gumtree/issues/148
        
       | austincheney wrote:
       | I worked on this problem for just over a decade before moving
       | onto other things. It's a tough problem to solve.
       | 
       | The biggest problem I ran into is that the largest segment of
       | user growth were too fickle. They wanted all kinds of magic in
       | new optional features for their personal preferences that took
       | incredible effort. I lacked the analytics to see who used which
       | exotic features. Most of these people just wanted a beautifier
       | more that a diff tool and would drop you in a heartbeat for more
       | popular tools that wouldn't do what they wanted but were popular.
       | 
       | The tool I wrote did have a strong following mostly around markup
       | language parsing that was not at all exotic but solved problems
       | other tools refused to approach.
       | 
       | My guidance is don't become a code beautifier. In the languages I
       | was supporting during the time frame I was supporting this code
       | beautifiers were all the rage. Nobody seemed to want a diff tool
       | with extra capabilities. Stick to being a diff tool. The people
       | that are intentionally looking for intelligent diff tools tend to
       | be more engineering focused and make for a loyal audience. People
       | looking for code vanity are just the same as window shoppers
       | walking down a street.
        
         | Wilfred wrote:
         | Thanks, this is good advice. There are some super featureful
         | diff tools out there. For example,
         | https://github.com/dandavison/delta does a line-based diff but
         | it also syntax highlights its output.
         | 
         | I'm hoping that defining syntax in a separate TOML file will
         | let end users extend difftastic for their own languages/config
         | files. I want to keep difftastic small and manageable.
        
         | pdimitar wrote:
         | Can you give us a link to your tool?
        
           | hinoki wrote:
           | The GP's first submission to HN was a link to a diffing tool,
           | so I assume it's this: https://prettydiff.com/
        
       | runeks wrote:
       | It's a great idea, but I don't think defining the syntax of a
       | programming language as a _syntax.toml_ file will work for enough
       | programming languages for this to be useful. You 're basically
       | rewriting the parser of your language in a DSL that isn't as
       | expressive as the language the parser is written in.
       | 
       | I think you'd need another parser/syntax interface for this to
       | work. E.g. running a binary that you can submit source code to
       | which responds with a JSON file containing the parsed tokens.
       | That way you can reuse the compiler's parser.
        
         | Wilfred wrote:
         | Yeah, so it's basically a lexer with an extremely simplistic
         | parser.
         | 
         | Compiler parsers aren't a great fit for difftastic. They
         | discard comments, they may not give you output if there are
         | syntax errors, and they're usually tied to a specific language
         | version.
         | 
         | Since this format works well for Comby (Rijnard's talk is
         | excellent: https://www.youtube.com/watch?v=JMZLBB_BFNg ) I'm
         | hopeful it's an adequate solution for Difftastic.
         | 
         | It will also users to add their own custom syntax/config
         | formats.
         | 
         | That said, using tree-sitter might be an option. It's more
         | forgiving than compiler parsers.
        
           | zhengyi13 wrote:
           | FWIW, as soon as I saw this project, I wondered specifically
           | about Treesitter's applicability to this problem, and I found
           | https://github.com/afnanenayet/diffsitter.
           | 
           | Maybe there's something to be learned there?
        
       | Kinrany wrote:
       | Reminds me of Comby [1]: it's language-agnostic and relies on
       | various brackets to make search-and-replace more structured.
       | 
       | Edit: ah, of course, Comby is referenced in the Readme.
       | 
       | [1]: https://comby.dev/
        
       | secondcoming wrote:
       | How can he have crashes if it's written in Rust?
        
         | dcminter wrote:
         | Probably meaning 'panic' - e.g. unwrapping a Result without
         | allowing for an error result.
        
           | Wilfred wrote:
           | Yep! There's a lot of .unwrap() and .expect() in the
           | codebase, so it panics. Since it's Rust, you get a line
           | number and an error message rather than a segfault.
           | 
           | I will tidy it up at some point, but I spend too much time
           | throwing away ideas that don't work. Defensive code is silly
           | if you delete it after!
        
       | beermonster wrote:
       | A good time to also point out https://tekin.co.uk/2020/10/better-
       | git-diff-output-for-ruby-...
        
       | blixt wrote:
       | Been hoping for more of this for years. We stare at diffs all day
       | yet we have to accommodate the computer by understanding that the
       | parenthesis it claims was changed wasn't actually changed, there
       | was just another set of parentheses added. There's of course
       | limits to how much a diff tool can extract meaning from two
       | pieces of content, but structure and perhaps even heuristics like
       | "new function was added here, maybe the curly brace belongs with
       | that and not the old function" would certainly help.
        
         | vcmiraldo wrote:
         | you'll probably enjoy the patience diff:
         | https://blog.jcoglan.com/2017/09/19/the-patience-diff-algori...
        
       | awinter-py wrote:
       | syntax-aware semantic history would be incredibly useful for code
       | review and codebase archaeology. better detection of global
       | rename, refactorings, and moves would make CR diffs way less
       | messy.
       | 
       | code review is a necessary but painful part of releasing good
       | code on a team; anything that makes it slightly easier is a force
       | multiplier for companies whose main bottleneck is software
        
       | e_proxus wrote:
       | Couldn't something like this be based on e.g. Sublime's syntax
       | definitions? Then it would work on all languages/formats that had
       | such a definition.
        
       | foreigner wrote:
       | I built a diff tool for spreadsheets years ago:
       | https://support.smartbear.com/collaborator/docs/working-with...
       | 
       | Never really worked all that well. I looked for research on how
       | to diff something like that but didn't find anything useful. IIRC
       | the diff works by "serializing" the cell grid, effectively
       | treating each cell as a separate "line" and then running that
       | through a conventional line-based diff algorithm.
        
       | jd115 wrote:
       | What I really want to know is what can I use to diff XML files,
       | semantically?
        
         | vbarta wrote:
         | http://mangrove.cz/diffmark/ (full disclosure: I wrote that),
         | for example...
        
         | beermonster wrote:
         | I usually c14n them and then diff
        
       ___________________________________________________________________
       (page generated 2021-07-08 23:02 UTC)