[HN Gopher] Difftastic: Syntax-aware structured diff tool
___________________________________________________________________
Difftastic: Syntax-aware structured diff tool
Author : ingve
Score : 275 points
Date : 2021-07-08 06:03 UTC (16 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| SeriousM wrote:
| That's awesome!
|
| Much better than "guessed" syntax diff or even line diff.
| pedrovhb wrote:
| This is cool. I work with a large Python codebase that's been
| around for a while (hasn't been that long since it's been
| completely off Python 2). It's not a bad codebase, but naturally
| it has almost no typing annotations, and other devs are just
| starting to come around to the idea.
|
| I love `mypy --strict` but it produces way too much noise with
| all the other code, so I've been using a tool I made that runs
| mypy over the changed files and then greps the output to filter
| for lines that `git diff` points out have changed. It's pretty
| rough and imperfect (doesn't catch errors that started appearing
| in unchanged files, or unchanged lines it the same file, for
| instance) but it's still quite helpful. I've been meaning to make
| an improved version that runs mypy over the branch point, then
| over my branch, and then maps new and changed lines between them
| so it displays all errors that are new, but I haven't gotten
| around to it yet. It'd be useful for other tools too, like
| semgrep.
| NateEag wrote:
| I'd forgotten about this, but years ago I was working in a huge
| PHP nightmare codebase.
|
| So, I hacked together a pre-commit hook that blocked the commit
| only if the configured style checker registered errors on lines
| being added by the diff.
|
| It never got very polished, but I wound up using it in two
| codebases over the years.
|
| https://github.com/NateEag/diff-check
| zomglings wrote:
| This sounds well in line with what I have been building - a
| tool to take syntax-aware diffs across git commits:
| https://github.com/bugout-dev/locust
|
| It currently supports Python, Javascript, and Java. I like the
| idea of mypy changes, as well.
| pedrovhb wrote:
| I'll have a look, thanks!
| philipov wrote:
| This seems like a really difficult issue to solve in the general
| case, but I found that solving it for the specific case I had was
| a tractable problem.
|
| I have a need to diff the output of a query and compare it to the
| last time it ran, to do regression testing. Just diffing the
| resulting CSVs wasn't very useful, because I needed the ability
| to do things like ignore new columns, and report the exact column
| that had differences from the previous version.
|
| I was able to do that by defining a primary key on which I could
| outer join the two tables. Missing or new rows would be the ones
| that didn't join, and then I could do a per-column comparison for
| each row that did join.
| Wilfred wrote:
| Author here! Happy to answer any questions :)
| Audiophilip wrote:
| My favorite diff tool is the one shipped with Plastic SCM, Xdiff.
| Since it's visual, it makes it very easy to see what changes have
| been done to the file.
|
| https://www.plasticscm.com/features/xmerge
| omgtehlion wrote:
| And theirs Git UI, gmaster, includes the same diff tech too. I
| do not use it for everyday tasks, but when I need to understand
| complex and/or big changes this is my go-to tool.
| dan-robertson wrote:
| I personally am much more excited by "sliders" than the
| structure-aware diffs. Marking additions between [], it is the
| difference between e.g. handle_case [some new
| case over multiple lines handle_case] some existing case
| [ check_invariant(); } function newFunc(){
| ... ] check_invariant(); }
|
| And [handle_case some new case]
| handle_case old case [function newFunc(){ ...
| check_invariant(); }]
| Wilfred wrote:
| I agree sliders are a problem, and I hope to have a solution
| there.
|
| Syntactic differs already do better because they understand
| that parentheses/brackets are paired. Difftastic does OK with
| this example: https://imgur.com/a/pVlVBo5
| dan-robertson wrote:
| Yeah I'm keen to see your solution.
|
| FWIW, the formatting of the snippets I wrote above was as two
| separate diffs for additions with the new additions (ie green
| parts) represented with [square brackets].
| mookid11 wrote:
| I wrote diffr [0] for that purpose; it serves me well, especially
| if your team makes code with long lines.
|
| In my opinion, a simple approach that does NOT make any parsing
| is more efficient (what about bugs in your parser? code with
| syntax errors? also, how fast would the parser be?)
|
| [0]: https://github.com/mookid/diffr
| feanaro wrote:
| Many of your concerns could be alleviated by using Tree-Sitter.
| (https://tree-sitter.github.io/tree-sitter/)
| pfdietz wrote:
| Tree-sitter is great, but I find it could do a better job
| with broken code. This is particularly important when parsing
| things like C or C++ where the preprocessor makes it likely
| that unpreprocessed code can't be parsed anyway.
| tlamponi wrote:
| Surely far from being as elaborate as the linked tool, but I use
| the following git command a few dozen times daily:
| git diff --word-diff=color --word-diff-regex='\w+'
|
| I added two aliases to my .gitconfig, one for diff and one for
| show: [alias] word-show = show --word-
| diff=color --word-diff-regex='\\w+' word-diff = diff
| --word-diff=color --word-diff-regex='\\w+'
|
| Those small things improved development and reviewing a lot for
| me!
|
| If stuff moved around or got it's indentation changed I either
| add `--color-moved` and/or `-w` (ignore whitespace changes) flags
| to filter out extra noise.
|
| Sometimes I need to use another regex though, e.g. a simple dot
| `.` for match all with no greedy +
| mikepurvis wrote:
| I'd feel so much more motivation for checking out alternative
| diff tools if there was a better story for integrating them with
| the review tools in Github, GitLab, etc. I know there's nothing
| anyone can do about that-- it's something the Git hosts
| themselves have to enable, or I have to see enough benefit in it
| to go to an dedicated review tool to make the bother of that
| worthwhile.
|
| I believe Gerrit has a pluggable diff-- is there anything more
| broadly on improving this story?
| Wilfred wrote:
| Definitely!
|
| I still look at diffs in the terminal pretty often, but all my
| code reviews are in rendered HTML.
|
| That said, there needs to be a credible tool before review
| tools can adopt it! GitHub does a line-based diff with word-
| based highlighting, which is probably the best you can do
| without syntactic smarts.
| mikepurvis wrote:
| It would be neat if there was a way to supply a "diff hint"
| or something right in your git commit metadata. Obviously the
| receiver/reviewer/renderer can ultimately do whatever they
| want, but it would helpful if I as the one preparing the
| change could at least specify intent.
|
| I guess projects like the kernel where the review system is
| built around emailed patches kind of already get this for
| free-- once committed, the change will be rendered according
| to the local user's git settings, but during the review
| itself, it will be a diff prepared by the change's author
| that will be under discussion.
|
| In a glorious future where GitLab has four different diff
| options, it would be great if I could specify that I want it
| to default to the hinted diff tool, falling back to my
| preferred one if there is no hint.
| bifftastic wrote:
| I like the name
| davidkunz wrote:
| To ease the pain in conventional differs, we use a pre-commit
| hook to format the source code (prettier). This way we only see
| differences if something _actually_ changed.
| conceptme wrote:
| This worsens the problem especially in templates when the
| nesting changes.
| frafra wrote:
| Good diff tools will only show you that the indentation
| changed, not the line as a whole (Meld for example).
| danuker wrote:
| Prettier works for JS.
|
| A similar Python tool is Black: https://github.com/psf/black
|
| > Black makes code review faster by producing the smallest
| diffs possible.
| Wilfred wrote:
| A syntactic differ like Difftastic is very helpful when your
| codebase is autoformatted. Formatters often reflow code.
|
| Given the code: foo(one, two, three);
|
| If you add an argument and reformat: foo(
| one, two, new, three );
|
| A line-based diff can make it hard to spot what's changed.
| maw wrote:
| When I was still formatting code manually (... -ish; emacs
| did a lot of the tedious work for me) I eventually settled on
| a style very similar to your four-argument example, precisely
| because it makes diffs easier to read.
|
| For the same reason, I asciibetically sorted things when it
| made sense.
|
| Now I use prettier and black and I'm mostly satisfied by
| them, but their reflow behavior puts the lie to "[black]
| makes code review faster by producing the smallest diffs
| possible."
| Cthulhu_ wrote:
| I think code formatting should be mandatory and one of the
| first things you adopt in your project. Resist code style rule
| changes as much as possible, and if you do, apply them across
| the whole codebase in one go to avoid churn and noise in diffs
| down the line.
|
| And if you do make style changes, put them in a separate commit
| at the very least so the diffs are cleaner and code reviews are
| easier.
|
| In my project I use gofmt (goimports) for back-end code and
| prettier for front-end; I've configured my editor to apply
| those on save, and a pre-commit hook to either run the
| formatter, or error if the formatting is not according to the
| spec.
|
| One of Go's proverbs is "Gofmt's style is no one's favorite,
| yet gofmt is everyone's favorite.". Consistency and low noise
| is more important (in that case) than a specific code style
| preference.
| zamalek wrote:
| > gofmt
|
| gofmt is in a small set of formatters that disallows
| configuration and choice, it's also the first to my
| recollection. This is a feature because deciding on a coding
| standard is bike shedding. It's also extremely aggressive,
| and undoes pretty much any choice you may make in formatting
| your code; a feature, again.
|
| Not all languages are this fortunate. Some have configuration
| (cargo fmt), others aren't aggressive enough (Roslyn).
| llimllib wrote:
| you might want to check out gofumpt too, if you haven't
| already: https://github.com/mvdan/gofumpt
| TotempaaltJ wrote:
| And if you do decide to do single-commit massive style
| changes, add the commits to an ignore revs file: http://git-
| scm.com/docs/git-config#Documentation/git-config....
| lpapez wrote:
| Did not know that thing existed, super useful. Thanks a
| ton.
| vcmiraldo wrote:
| That is a really difficult problem for more reasons than what
| fits in this comment :) In fact, I got my PhD studying this very
| problem (https://victorcmiraldo.github.io/data/MiraldoPhD.pdf).
|
| I did not find any description of how your diffing algorithm
| works nor how you represent a patch. I'd be really curious to
| know more.
| Wilfred wrote:
| Wow, thank you for the pointer! I've added it to
| https://github.com/Wilfred/difftastic/wiki/Structural-Diffs as
| I'm trying to understand the other solutions in this space.
|
| Difftastic does not create a patch or worry about merging.
| That's a hard problem that I'm not trying to solve. Instead, it
| builds two ASTs, then marks each node as unchanged or novel.
|
| To compute the diff, I use a graph search. Each vertex
| represents a position in both the left and right ASTs.
|
| Suppose you're comparing A with X A.
|
| Start node: Left: A Right: X A ^
| ^
|
| The possible next nodes are:
|
| (1) Treat A on the left as novel. Left: A
| Right: X A ^ ^
|
| (2) Treat X on the right as novel. Left: A
| Right: X A ^ ^
|
| Both (1) and (2) are the same 'distance', but (2) is closer to
| the end node, because there's a edge from (2) to the end that
| marks A as unchanged.
|
| I've implemented this using Dijkstra's algorithm. My graph is
| directed and acyclic, so there are faster algorithms like
| topological sort. However, I don't construct the whole graph in
| advance (that would take O(N^2) memory) so instead I construct
| the graph nodes as necessary.
|
| (This is very similar to Autochrome, which I've linked in the
| README. Autochrome has a worked example which is really
| helpful.)
|
| At some point I think I'll have to use A* search instead. If
| there are more than 500 lines of code with lots of changes,
| difftastic can take a few seconds to terminate due to the naive
| graph search.
| dan-robertson wrote:
| I think the diffing is the "obvious" graph search algorithm
| between trees, where a "tree" is a list of atoms or trees
| (think lisp lists).
|
| Basically to diff a tree of n top-level elements against one of
| m elements, construct a graph where nodes lie on an (n+1)x(m+1)
| grid. Each node (a,b) corresponds to having looked at a
| elements of the first and matched them to b elements of the
| second list. Add edges (a,b)->(a+1,b) for deletion;
| (a,b)->(a,b+1) for insertion; and (a,b)->(a+1,b+1) for an inner
| diff (ie basically this graph search problem again). Choose
| weights to apply to node and now find the shortest path from
| (0,0) to (n,m).
| vcmiraldo wrote:
| From you description it seems like we're just computing the
| standard insert-delete tree-edit-distance. These tend to be
| slow.
|
| This implies that the patch language only supports insertion,
| deletion and modification of nodes, which is a shame since
| refactorings, moves and duplications are also common
| operations in the source-code domain. Additionally, if the
| patch language only supports insertion, deletion and
| modification, the merging algorithm will perform poorly.
| Wilfred wrote:
| Yep, that's a fair description. Note that I'm not providing
| a merge algorithm, just a pretty way of viewing changes.
|
| I did look at modelling moves in an earlier prototype, but
| it's incredibly hard to display the result in a coherent
| way when there are also insertions. It was also easier to
| drop it when I moved to Dijkstra.
|
| As you can see in the screenshot in the readme, it does
| support inserting tree nodes whilst preserving children,
| which covers a ton of cases.
| zokier wrote:
| Gumtree diff is based on ast:
| https://github.com/GumTreeDiff/gumtree
|
| There is some discussion of using Treesitter for parsing, that
| would potentially open door for many languages:
| https://github.com/GumTreeDiff/gumtree/issues/148
| austincheney wrote:
| I worked on this problem for just over a decade before moving
| onto other things. It's a tough problem to solve.
|
| The biggest problem I ran into is that the largest segment of
| user growth were too fickle. They wanted all kinds of magic in
| new optional features for their personal preferences that took
| incredible effort. I lacked the analytics to see who used which
| exotic features. Most of these people just wanted a beautifier
| more that a diff tool and would drop you in a heartbeat for more
| popular tools that wouldn't do what they wanted but were popular.
|
| The tool I wrote did have a strong following mostly around markup
| language parsing that was not at all exotic but solved problems
| other tools refused to approach.
|
| My guidance is don't become a code beautifier. In the languages I
| was supporting during the time frame I was supporting this code
| beautifiers were all the rage. Nobody seemed to want a diff tool
| with extra capabilities. Stick to being a diff tool. The people
| that are intentionally looking for intelligent diff tools tend to
| be more engineering focused and make for a loyal audience. People
| looking for code vanity are just the same as window shoppers
| walking down a street.
| Wilfred wrote:
| Thanks, this is good advice. There are some super featureful
| diff tools out there. For example,
| https://github.com/dandavison/delta does a line-based diff but
| it also syntax highlights its output.
|
| I'm hoping that defining syntax in a separate TOML file will
| let end users extend difftastic for their own languages/config
| files. I want to keep difftastic small and manageable.
| pdimitar wrote:
| Can you give us a link to your tool?
| hinoki wrote:
| The GP's first submission to HN was a link to a diffing tool,
| so I assume it's this: https://prettydiff.com/
| runeks wrote:
| It's a great idea, but I don't think defining the syntax of a
| programming language as a _syntax.toml_ file will work for enough
| programming languages for this to be useful. You 're basically
| rewriting the parser of your language in a DSL that isn't as
| expressive as the language the parser is written in.
|
| I think you'd need another parser/syntax interface for this to
| work. E.g. running a binary that you can submit source code to
| which responds with a JSON file containing the parsed tokens.
| That way you can reuse the compiler's parser.
| Wilfred wrote:
| Yeah, so it's basically a lexer with an extremely simplistic
| parser.
|
| Compiler parsers aren't a great fit for difftastic. They
| discard comments, they may not give you output if there are
| syntax errors, and they're usually tied to a specific language
| version.
|
| Since this format works well for Comby (Rijnard's talk is
| excellent: https://www.youtube.com/watch?v=JMZLBB_BFNg ) I'm
| hopeful it's an adequate solution for Difftastic.
|
| It will also users to add their own custom syntax/config
| formats.
|
| That said, using tree-sitter might be an option. It's more
| forgiving than compiler parsers.
| zhengyi13 wrote:
| FWIW, as soon as I saw this project, I wondered specifically
| about Treesitter's applicability to this problem, and I found
| https://github.com/afnanenayet/diffsitter.
|
| Maybe there's something to be learned there?
| Kinrany wrote:
| Reminds me of Comby [1]: it's language-agnostic and relies on
| various brackets to make search-and-replace more structured.
|
| Edit: ah, of course, Comby is referenced in the Readme.
|
| [1]: https://comby.dev/
| secondcoming wrote:
| How can he have crashes if it's written in Rust?
| dcminter wrote:
| Probably meaning 'panic' - e.g. unwrapping a Result without
| allowing for an error result.
| Wilfred wrote:
| Yep! There's a lot of .unwrap() and .expect() in the
| codebase, so it panics. Since it's Rust, you get a line
| number and an error message rather than a segfault.
|
| I will tidy it up at some point, but I spend too much time
| throwing away ideas that don't work. Defensive code is silly
| if you delete it after!
| beermonster wrote:
| A good time to also point out https://tekin.co.uk/2020/10/better-
| git-diff-output-for-ruby-...
| blixt wrote:
| Been hoping for more of this for years. We stare at diffs all day
| yet we have to accommodate the computer by understanding that the
| parenthesis it claims was changed wasn't actually changed, there
| was just another set of parentheses added. There's of course
| limits to how much a diff tool can extract meaning from two
| pieces of content, but structure and perhaps even heuristics like
| "new function was added here, maybe the curly brace belongs with
| that and not the old function" would certainly help.
| vcmiraldo wrote:
| you'll probably enjoy the patience diff:
| https://blog.jcoglan.com/2017/09/19/the-patience-diff-algori...
| awinter-py wrote:
| syntax-aware semantic history would be incredibly useful for code
| review and codebase archaeology. better detection of global
| rename, refactorings, and moves would make CR diffs way less
| messy.
|
| code review is a necessary but painful part of releasing good
| code on a team; anything that makes it slightly easier is a force
| multiplier for companies whose main bottleneck is software
| e_proxus wrote:
| Couldn't something like this be based on e.g. Sublime's syntax
| definitions? Then it would work on all languages/formats that had
| such a definition.
| foreigner wrote:
| I built a diff tool for spreadsheets years ago:
| https://support.smartbear.com/collaborator/docs/working-with...
|
| Never really worked all that well. I looked for research on how
| to diff something like that but didn't find anything useful. IIRC
| the diff works by "serializing" the cell grid, effectively
| treating each cell as a separate "line" and then running that
| through a conventional line-based diff algorithm.
| jd115 wrote:
| What I really want to know is what can I use to diff XML files,
| semantically?
| vbarta wrote:
| http://mangrove.cz/diffmark/ (full disclosure: I wrote that),
| for example...
| beermonster wrote:
| I usually c14n them and then diff
___________________________________________________________________
(page generated 2021-07-08 23:02 UTC)