[HN Gopher] How far should a programming language aware diff go?
___________________________________________________________________
How far should a programming language aware diff go?
Author : thunderbong
Score : 58 points
Date : 2024-07-18 17:18 UTC (4 days ago)
(HTM) web link (semanticdiff.com)
(TXT) w3m dump (semanticdiff.com)
| firethief wrote:
| Interesting idea. I've just tried it with a couple of languages:
|
| - TS with Vue: SFC are not really working (it's showing a style
| change as if the whole stylesheet were replaced with a mostly-
| identical stylesheet).
|
| - Rust: It doesn't seem semantic at all. It's showing a lot of
| character-level insertions and deletions that seem worse than how
| git-diff or GitHub would break down the changes.
|
| It doesn't seem ready yet for what I'd like to use it for.
| DarkPlayer wrote:
| Hi, author of SemanticDiff here.
|
| I'm sorry you didn't have a good experience testing the tool.
| If it doesn't work / makes things worse than a standard diff,
| that's definitely considered a bug. It is probably something
| specific to your code and not a general issue. It would
| therefore be great if you could open an issue [1] or support
| ticket [2], ideally with some sample code, so we can take a
| look. Thanks in advance!
|
| [1] https://github.com/Sysmagine/SemanticDiff/issues [2]
| support@semanticdiff.com
| PullJosh wrote:
| I was expecting this to refer to different ways to represent the
| same diff. (For example, you could represent a change from
| `console.log("hello")` as `console.log('hello')` as +'-" ... +'-"
| or as +'hello'-"hello")
|
| I don't have a specific example in mind, but it seems reasonable
| that different languages could benefit from different ways of
| representing the same diff.
| kmoser wrote:
| > - const foo = function(a, b) { ... }
|
| > + const foo = (a, b) => { ... }
|
| Assuming this is JS code, these differences should not be
| ignored, as an arrow function can behave differently than a
| traditional function.
| culi wrote:
| More specifically, the `function` keyword version of an
| anonymous function preserves the keyword `this` whilst the
| arrow syntax anonymous function does not. Arrow functions also
| cannot use the `yield` keyword nor be used as constructors
|
| https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...
| egnehots wrote:
| There is also the function scope vs block scope...
|
| var x = 3 will escape the latter.
| culi wrote:
| I believe this is actually a difference between named and
| anonymous functions. The named function syntax is
| function foo1() { ... }
|
| Both of the below examples are anonymous functions
| const foo2 = function() { ... } const foo3 = ()
| => { ... }
| jolmg wrote:
| If you mean the following, it actually doesn't:
| (function () { var x = 0; const foo1 =
| function(a, b) { var x = 2; } const foo2 = (a, b)
| => { var x = 3; } foo1(); console.log(x); // prints
| 0 foo2(); console.log(x); // prints 0 })()
|
| However, there is the difference in how the implicit
| semicolons are inserted: const foo1 =
| function(a, b) { return a + b; } (2, 3)
| console.log(foo1) // prints 5 const foo2 = (a,
| b) => { return a + b; } (2, 3)
| console.log(foo2) // prints [Function: foo2]
| olliej wrote:
| haha, oh ASI - I was very confused by your example as I
| read this as if it was > const foo1 =
| function(a, b) { return a + b; } (2, 3) >
| console.log(foo1) // prints 5 > const foo2 =
| (a, b) => { return a + b; } (2, 3) >
| console.log(foo2) // prints [Function: foo]
|
| _Even though it makes no sense for (2, 3) to be a result
| in those cases_ , that was just how I ended up reading
| it, and I was exceptionally confused about how the
| printed output could possibly happen.
|
| A super nice example of how subtle differences can really
| change things though.
|
| As a side note, ASI for JS is actually super easy to
| implement and the rules are actually really simple
| (leaving aside whether the feature itself is good :D ) as
| it's just "these specific statements can have a new line
| instead of a semicolon" - so in the parser instead of
| consume(semicolon) you can just do "semicolon or newline"
| (You can check the logic in JSC in https://github.com/Web
| Kit/WebKit/blob/main/Source/JavaScript... - just look for
| autoSemicolon() or autoSemi() I can't recall off the top
| of my head)
| glhaynes wrote:
| My guess would be that quite a large portion of changes we'd
| expect at a glance to be identical aren't, especially for
| inputs that would not be expected. I'd also guess this is much
| more likely in languages in which valid code commonly produces
| undefined behavior.
|
| If the tool could show you, for example, "this change is
| functionally identical _except_ for when the sum of the two
| inputs overflows a UInt64 ", that'd be pretty cool.
| culi wrote:
| I think you've answered the question posed by the title here.
| That's feels too far
| kmoser wrote:
| That would neat, although I suspect most compilers/linters
| should already be able to warn you about potential overflows.
|
| If you want to boil down what devs are looking for in a diff
| tool to one thing, it would be "which change(s) between these
| two versions of code result in a different binary (or
| AST/opcodes/bytecode, depending on the language)?" All other
| changes, while certainly sometimes useful to know about, are
| just syntactic sugar.
| tsimionescu wrote:
| Literally every time you add/subtract/multiply two
| variables there is a potential overflow. In relatively rare
| cases, the compiler might be able to prove that they can't
| overflow, but in the general case it can't, and I doubt any
| actually do.
| olliej wrote:
| Yeah I came to say that these are not semantically equivalent
| (I guess you _could_ verify equivalence if you ensured it did
| not use this or eval)
| g4zj wrote:
| I've never knowingly used a language-aware diff tool before, but
| I wouldn't mind the option. I think it would come in handy on
| occasion.
| Smaug123 wrote:
| Personally I have `git difft` aliased to `difft --display side-
| by-side`, so it's one extra character for a semantic diffing
| tool (for me, Difftastic).
| 1-more wrote:
| Because I am silly I have dit () {
| verb="$1" shift 1
| GIT_EXTERNAL_DIFF=difft git $verb --ext-diff $@ }
|
| it works for `dit d` and `dit show HEAD` but it fails on `dit
| stash show -p stash@{0}`
| g4zj wrote:
| Thanks for mentioning Difftastic. It looks very interesting!
| I'll give it a try.
| gumby wrote:
| Never even used `diff -p`? That's been in diff for many
| decades.
| Jabbles wrote:
| nit: the order of Go's imports makes no difference:
| https://go.dev/ref/spec#Program_initialization_and_execution...
| twic wrote:
| Accurately identifying whether any change is a semantic
| difference involves solving the Halting Problem, right?
| Smaug123 wrote:
| Fortunately the article already says that hiding all
| semantically identical changes is "probably going too far", so
| they can just _not_ try and solve the halting problem.
| rob74 wrote:
| I _think_ I have heard of their product before, and reading the
| blog post intrigued me, so I wanted to try it, but... VS Code
| Integration? GitHub Integration? No standalone version which you
| could actually use as a diff tool for git locally? Ok, I guess
| only having a "cloud" version makes licensing easier, and you
| can call me old fashioned, but seeing an eminently "offline" task
| such as diff being turned into "online-only" seems a bit strange
| to me.
| DarkPlayer wrote:
| The VS Code extension works offline. The diff calculation is
| performed on the host where the VS Code GUI is running (makes a
| difference in case of SSH/Docker/WSL).
| joe_the_user wrote:
| It's a very interesting question. One idea I've toyed with over
| the years is a language specifically designed to facilitate
| effective diffs.
|
| Anyway, it seems the "Level 3: semantic diff" actually could be
| divided into different levels. But "Level 4: Mostly identical"
| seems quite problematic.
| jmull wrote:
| I think the general answer is, it depends.
|
| Hopefully this tool gives a dev ready control of what kinds of
| differences to hide/show.
|
| I'm actually not convinced of the concept of semantic diff (not
| talking just about this tool specifically)... when we talk about
| code that is different but equivalent, I think we're talking
| about elements of style.
|
| It seems to me that it would pretty much always be better to
| normalize the elements of style considered insignificant, rather
| than hide them just in the diff tool. That covers diffing as well
| as viewing/reading the code.
|
| If you don't care about a particular element of style then either
| it shouldn't be coming up much or I think you'd be better off
| using some kind of enforcing/fixing linter.
| rty32 wrote:
| In theory semantic diff is useful, but based on my code review
| experience, it hardly matters. For a language like Python or
| JavaScript, a developer fluent in these languages don't really
| pay much attention to these things anyway, just like you don't
| normally pay much attention to commas and periods in a sentence
| unless it causes confusion. Personally I wouldn't pay $5/month
| out of the pocket for this functionality.
| xg15 wrote:
| I think I'd appreciate some sort of "semantic grouping" of
| individual changes more than drawing someone random line and
| classifying all changes below it as "trivial".
|
| The problem is that even a lot of the changes that normally
| constitute clutter can become relevant in certain situations or
| even introduce bugs.
|
| One example would be ordering of Python imports: Changing the
| order of imports _should_ have no effect on program behaviour if
| all your packages are well-behaved - and in 99.99% of cases it
| indeed hasn 't. But the fact remains that imports are statements
| that are executed and can have side-effects. If a package does
| something nontrivial during load, changing the import order _can_
| have effects. Hiding such a change could mask introduction of a
| bug.
|
| Hiding changes can also lead to confusion if you are trying to
| understand a series of changes that are based on each other, or
| if _all_ changes of a commit are hidden. I 've had the latter
| situation with IntelliJ, where the working tree was shown as
| "unclean" but the diff was completely empty. Solution: The diff
| wasn't actually empty, IntelliJ was just set to hide the changes.
|
| I think a more interesting solution would be to build a sort of
| "tree of changes": At the bottom, you'd have the individual
| changes in the file; one level up, the changes would be grouped
| into higher-level operations, such as "change formatting",
| "rename identifier", "remove field", "move function", etc. If
| possible, those could be grouped into even higher-level changes,
| such as "implement new class" or "extract expression into
| function", etc.
| blackenedgem wrote:
| I think the problem you'll eventually run into is figuring out
| intent from the diff. It seems like an easier version of
| reverse compiling.
|
| When it comes down to semantic diffs I'm more interested in
| something like the Semantic Patch Language by Coccinelle. Being
| able to represent mundane refactorings across an entire
| codebase in a few lines seems great. And it unifies intent with
| the diff.
| golergka wrote:
| And just like that, another GPT-4 wrapper startup was born.
| tsimionescu wrote:
| Agreed, I don't think the value of a semantic diff would be in
| hiding changes. Instead, the value should be in generating more
| useful diffs.
|
| Normal diff often gets "confused" compared to how you'd
| logically identify the code. For example, if you extract a
| piece of a larger function as a smaller function, instead of
| showing that a piece of code was moved, it will show that you
| changed a header, deleted some lines, added others below, etc.
| A semantic diff should be able to refine these diffs in a
| better way, but shouldn't hide them. Even for the whitespace
| changes, I'd like it to show the diff, but the overlay to
| explain that only whitespace is different, so I know I don't
| need to look at it carefully.
| whirlwin wrote:
| After switching to difftastic for semantic diff, I have never
| looked back. (https://github.com/Wilfred/difftastic)
|
| How does semanticdiff compare to that? Anyone got experience?
| DarkPlayer wrote:
| You can find a comparison of the two tools here:
| https://semanticdiff.com/blog/semanticdiff-vs-difftastic/
|
| As author of SemanticDiff, I am obviously a bit biased. But
| Wilfred, the author of difftastic, found the analysis to be
| "pretty even-handed" [1], so I think it should be somewhat
| fair.
|
| [1]: https://x.com/_wilfredh/status/1764424652611318146
| zokier wrote:
| I think this question has been already largely been answered by
| automatic style (etc) tools. Such tools generally should not make
| semantic changes to programs, so they (implicitly) define what
| are meaningful semantic changes and what are meaningless changes.
| emporas wrote:
| There is also diffsitter. I was testing it a month ago, it works
| fine. Not sure what language-aware diffing exactly means, but
| diffsitter uses tree-sitter and it is comparing ASTs and CSTs of
| the files.
|
| [1] https://github.com/afnanenayet/diffsitter
| MathMonkeyMan wrote:
| I haven't actually checked the source, but I've heard that clang-
| format works by assigning "badness" weights to each choice of
| whitespace between tokens, and then runs Dijkstra's (or some
| other DP) to find the least bad set of choices. A recent Tom7
| video said that Knuth did the same thing for text justification.
|
| How about we do a similar thing for ASTs? Like a peephole
| optimizer looking for runs of instructions that could be
| substituted for simpler alternatives, a tree diff could identify
| diff patterns that "might be trivial." You have a whole catalog
| of these patterns, and assign to each a weight. Then the
| displayed diff is the optimal set of choices "consider different
| or not?"
|
| You would need some additional ingredient, though; some boundary
| condition. Otherwise "everything is the same" would always
| minimize badness.
| amelius wrote:
| https://en.wikipedia.org/wiki/Edit_distance
| ckdot2 wrote:
| Not far. Just show all changes. Like the blog article already
| states, for many projects you already have code formatters, so
| changes in format usually don't happen a lot - and if they do
| there might be a reason you don't want to hide (like... you
| change your rules of code formatting). For all the other example
| I neither see the point why you would want to hide it. If you
| don't want to see commas added in a list, make it a rule that the
| comma always has to be appended after the last element. Most
| languages allow that. Semantic equivalence? The JS example isn't
| even equivalent because ,,this" may have a different context. I'd
| prefer to have a ,,dumb" diff that simply shows all the changes
| instead of adding these kind of complexities. Just keep your MRs
| small and there's no real issue.
| philipwhiuk wrote:
| I think a slider on the code review would be nice.
|
| That way I could start at the 'definitely a change' stuff and
| then slide down towards L2 until I decided it was fine.
| diffxx wrote:
| - def foo(): int | None + def foo(): None | int
|
| Whether or not this makes a semantic difference is language
| implementation dependent. I think that is why this kind of tool
| is not especially appealing to me. I would have to have almost
| complete knowledge of the compiler and the diff tool to truly
| trust that there is no semantic difference. Moreover, I would
| like to know why changes to the text that are being made that
| have no semantic effect are being mixed with those that do.
|
| For me, text is king and that is the level at which I want to
| evaluate diffs 99% of the time, but I do recognize that others
| have different goals and preferences.
___________________________________________________________________
(page generated 2024-07-22 23:08 UTC)