[HN Gopher] Diff Algorithms
___________________________________________________________________
Diff Algorithms
Author : znkr
Score : 70 points
Date : 2025-09-30 20:09 UTC (2 hours ago)
(HTM) web link (flo.znkr.io)
(TXT) w3m dump (flo.znkr.io)
| yboris wrote:
| Mildly related: my favorite tool for viewing .git diffs
| _diff2html_ - a CLI that with one command opens the diff in your
| browser
|
| https://diff2html.xyz/ -- https://github.com/rtfpessoa/diff2html
| ashu1461 wrote:
| Apart from source code versioning what are the other most
| important real world use cases of diff algorithms ?
| runningmike wrote:
| - Backup and restore
|
| - integrity checks from security perspective
|
| - nlp, finding same tokens in text
|
| Etc
| susam wrote:
| I encountered one about 17 years ago. It was for diffing IP
| packets, TCP segments, and network event payloads. At the time
| I worked at RSA Security on a network forensics and security
| analytics product, written in a mix of C and C++. In one of the
| projects I worked on, we needed to let users diff the packets,
| segments, and payloads. Back then we were very conservative
| about adding third-party libraries to the product. I have
| written more about that culture here:
| https://news.ycombinator.com/item?id=39951673
|
| Long story short, due to the conservative culture, most data
| structures and algorithms were implemented in house. The diff
| algorithm for packets/segments/payloads was written in house
| too and I was the one to write it.
|
| My implementation was based on a straightforward dynamic
| programming solution to the longest common subsequence problem.
| If I recall correctly, it ran in O(mn) time and O(min(m, n))
| space in the worst case, where m and n are the lengths of the
| two sequences. I knew there were more efficient algorithms, but
| this code was not performance critical. I chose to keep the
| implementation simple so anyone could understand it, learn it
| quickly, and fix bugs if they arose. It served us well for the
| next seven years until the product was replaced with a new one.
|
| On a related note, I sometimes miss that older style of
| software development where we would dive deep into a problem
| domain, master it, and design solutions ourselves. I am not
| being naively nostalgic though. I am very well aware that
| modern development, with its reliance on well established
| libraries, _usually_ delivers much greater productivity and
| reliability. Still, I think the slower and more deliberate
| approach of building things from the ground up had a certain
| charm.
| o11c wrote:
| There are at least 3 fundamentally different kinds of diff:
|
| * Single-dimensional. Diffs of text _lines_ are just this.
|
| * Multi-dimensional. Diffs of words or characters are usually
| going to be this since lines still matter, but there are multiple
| approaches (line-first? weighted tokens?).
|
| * Tree-based. Unfortunately, these are woefully scarce and poorly
| documented.
|
| For text diffs, it's nontrivial to get the "missing newline at
| end of file" logic working.
|
| For tree diffs, consider that for HTML something like
| `<p>x</p><p>y</p>` should be unmergeable, whereas
| `<b>x</b><b>y</b>` should be mergeable.
|
| (Aside: the blind promotion of `<em>` and `<strong>` did great
| harm to the notion of semantic HTML. Most things people use
| italics for (book titles, thoughts, foreign words) are explicitly
| things that `<em>` should _not_ be used for.)
___________________________________________________________________
(page generated 2025-09-30 23:00 UTC)