[HN Gopher] Diff Algorithms
       ___________________________________________________________________
        
       Diff Algorithms
        
       Author : znkr
       Score  : 70 points
       Date   : 2025-09-30 20:09 UTC (2 hours ago)
        
 (HTM) web link (flo.znkr.io)
 (TXT) w3m dump (flo.znkr.io)
        
       | yboris wrote:
       | Mildly related: my favorite tool for viewing .git diffs
       | _diff2html_ - a CLI that with one command opens the diff in your
       | browser
       | 
       | https://diff2html.xyz/ -- https://github.com/rtfpessoa/diff2html
        
       | ashu1461 wrote:
       | Apart from source code versioning what are the other most
       | important real world use cases of diff algorithms ?
        
         | runningmike wrote:
         | - Backup and restore
         | 
         | - integrity checks from security perspective
         | 
         | - nlp, finding same tokens in text
         | 
         | Etc
        
         | susam wrote:
         | I encountered one about 17 years ago. It was for diffing IP
         | packets, TCP segments, and network event payloads. At the time
         | I worked at RSA Security on a network forensics and security
         | analytics product, written in a mix of C and C++. In one of the
         | projects I worked on, we needed to let users diff the packets,
         | segments, and payloads. Back then we were very conservative
         | about adding third-party libraries to the product. I have
         | written more about that culture here:
         | https://news.ycombinator.com/item?id=39951673
         | 
         | Long story short, due to the conservative culture, most data
         | structures and algorithms were implemented in house. The diff
         | algorithm for packets/segments/payloads was written in house
         | too and I was the one to write it.
         | 
         | My implementation was based on a straightforward dynamic
         | programming solution to the longest common subsequence problem.
         | If I recall correctly, it ran in O(mn) time and O(min(m, n))
         | space in the worst case, where m and n are the lengths of the
         | two sequences. I knew there were more efficient algorithms, but
         | this code was not performance critical. I chose to keep the
         | implementation simple so anyone could understand it, learn it
         | quickly, and fix bugs if they arose. It served us well for the
         | next seven years until the product was replaced with a new one.
         | 
         | On a related note, I sometimes miss that older style of
         | software development where we would dive deep into a problem
         | domain, master it, and design solutions ourselves. I am not
         | being naively nostalgic though. I am very well aware that
         | modern development, with its reliance on well established
         | libraries, _usually_ delivers much greater productivity and
         | reliability. Still, I think the slower and more deliberate
         | approach of building things from the ground up had a certain
         | charm.
        
       | o11c wrote:
       | There are at least 3 fundamentally different kinds of diff:
       | 
       | * Single-dimensional. Diffs of text _lines_ are just this.
       | 
       | * Multi-dimensional. Diffs of words or characters are usually
       | going to be this since lines still matter, but there are multiple
       | approaches (line-first? weighted tokens?).
       | 
       | * Tree-based. Unfortunately, these are woefully scarce and poorly
       | documented.
       | 
       | For text diffs, it's nontrivial to get the "missing newline at
       | end of file" logic working.
       | 
       | For tree diffs, consider that for HTML something like
       | `<p>x</p><p>y</p>` should be unmergeable, whereas
       | `<b>x</b><b>y</b>` should be mergeable.
       | 
       | (Aside: the blind promotion of `<em>` and `<strong>` did great
       | harm to the notion of semantic HTML. Most things people use
       | italics for (book titles, thoughts, foreign words) are explicitly
       | things that `<em>` should _not_ be used for.)
        
       ___________________________________________________________________
       (page generated 2025-09-30 23:00 UTC)