[HN Gopher] CLI tool, written in Rust, to diff directory snapshots
       ___________________________________________________________________
        
       CLI tool, written in Rust, to diff directory snapshots
        
       Author : jotaen
       Score  : 32 points
       Date   : 2024-01-22 19:14 UTC (3 hours ago)
        
 (HTM) web link (www.jotaen.net)
 (TXT) w3m dump (www.jotaen.net)
        
       | hartator wrote:
       | Isn't this just Git?
        
         | quacker wrote:
         | I was surprised to not see a mention of `git diff` anywhere in
         | the post.
         | 
         | Out of curiosity, I setup two directories where the second has
         | a new file, a modified file, and a removed file compared to the
         | first:                   $ tree dir1/ dir2/         dir1/
         | +-- empty.txt         +-- modified.txt         +-- remove-
         | this.txt         +-- unchanged.txt         dir2/         +--
         | empty.txt         +-- modified.txt         +-- new-file.txt
         | +-- unchanged.txt
         | 
         | Then `git diff --name-status dir1 dir2` outputs the following,
         | showing the changes by file name.                   $ git diff
         | --name-status dir1 dir2         M       dir1/modified.txt
         | A       dir2/new-file.txt         D       dir1/remove-this.txt
         | 
         | This also doesn't require running `git init` on the two
         | directories either, so it's immediately usable out of the box.
        
         | mylittlebrain wrote:
         | That was my first thought. A content addressable file system*
         | like Git or Mercurial, would be the simplest thing that could
         | work. * Not sure these VCSs could be called a CAS.
        
         | jotaen wrote:
         | I suppose you could make a similar thing happen based on git,
         | or maybe even a combination of `find` and `diff`.
         | 
         | It's certainly an interesting debate whether to use a general-
         | purpose approach vs creating a dedicated and fully customisable
         | implementation. In this case, I was interested in the exact
         | numbers and output structure that snapdiff produces. I'm not
         | sure what it would take to make the same output happen by using
         | git, or how well that would work on directory sizes in (or
         | beyond) the 100.000 files / 100 GB region.
         | 
         | If someone would be up for trying that out and sharing their
         | insights, I'd be interested to learn about it.
        
       | codetrotter wrote:
       | OP should have a look at ZFS. With large amounts of data I feel
       | that ZFS snapshots might be far more time efficient to compare
       | than diffing full directories.
       | 
       | Bonus: FreeBSD is currently considering adding Rust to their base
       | system. They have ZFS natively in FreeBSD already. Perhaps OP
       | will find joy in FreeBSD :D
        
         | timetraveller26 wrote:
         | Yes, zfs rules. If you are doing this regularly you should
         | consider a fs like zfs or use other solution like Borg (which
         | is based on git).
         | 
         | Though this has the convenience of being a more universal
         | solution.
        
         | jotaen wrote:
         | I haven't looked into ZFS yet, but thanks to your comment, it's
         | now on my todo list.
         | 
         | One idea behind my implementation was to have something that's
         | more agnostic of specific file systems. But I guess that's an
         | aspect that may be worth to reconsider.
        
       | gumby wrote:
       | Why does it matter if it is written in Rust or assembly code?
        
         | omaranto wrote:
         | I think if you do write something in Rust it is customary to
         | mention it to avoid getting tons of suggestions to rewrite it
         | in Rust.
        
         | jacquesm wrote:
         | Because 'in Rust' is good for at least 30 upvotes.
        
         | timetraveller26 wrote:
         | They are special prompts for the HN LLM
        
         | throwaway8582 wrote:
         | Aside from arguments about performance and memory safety, I'm
         | generally more likely to try something written in Rust (or Go)
         | because projects in those languages tends to be easy to build
         | or download as a static binary. For Rust projects, `cargo
         | install <name>` generally works. On the other hand, when I see
         | something written in C++ or Python, it's an indicator that
         | there may be significantly more work involved
        
         | jotaen wrote:
         | I get the suspicious sentiment, but I mentioned it for other
         | reasons in the title. Apart from solving a personal need, this
         | project is largely about tinkering with Rust and performance
         | optimisations. I was hoping that mentioning the language
         | prominently would help attract people that may give valuable
         | feedback regarding those things.
        
       | rpigab wrote:
       | This looks nice! I like TUI programs for this purpose.
       | 
       | Previously, I've used Beyond Compare 4 by Scooter Software (GUI,
       | free to try), it's nice to have more options because diff -r
       | doesn't get you very far.
       | 
       | I also like the ability to find duplicate images or any files
       | regardless of location with Czkawka (Github qarmin/czkawka).
        
       | grow2grow wrote:
       | > (For some extra "entropy", by the way, snapdiff also takes the
       | file size into consideration when comparing files.)
       | 
       | I get what you mean by "entropy", but wouldn't it be more direct
       | to just say matching hashes will have their file sizes compared
       | as a remedy to the collision?
       | 
       | Thanks for the effort, the "back yard diy" utilities are great
       | for learning, especially with an accompanying article.
        
         | jotaen wrote:
         | Yeah, fair point - I couldn't think of a better word, that's
         | why I wrapped it in quotes. I've tried to simplified the
         | phrasing:
         | 
         | > For some extra safety margin to avoid collisions, by the way,
         | `snapdiff` also takes the file size into consideration when
         | comparing files.
        
       | mustache_kimono wrote:
       | Curious -- how does your system handle hardlinks? I have a ZFS
       | system which does a ZFS "roll forward":                   --roll-
       | forward=<ROLL_FORWARD> traditionally 'zfs rollback' is a
       | destructive operation, whereas httm roll-forward is non-
       | destructive.  httm will copy only files and their attributes that
       | have changed since a specified snapshot, from that snapshot, to
       | its live dataset...
       | 
       | One of more difficult problems I had to deal with was hardlink
       | resolution. Basically I still have to scan the whole dataset, and
       | create a map of hardlinks before any run.
        
         | jotaen wrote:
         | Oh thanks, that's a good point. I actually haven't considered
         | hardlinks at all yet.
         | 
         | I should probably look into that at some point -
         | https://github.com/jotaen/snapdiff/issues/2
        
       | sys_64738 wrote:
       | What does this do that plain old "diff dir1 dir2" command
       | doesn't?
        
       | rhettbull wrote:
       | I've created a similar tool in python that I find quite useful:
       | https://github.com/RhetTbull/dirsnapshot This doesn't compute
       | content hashes as your tool does but is designed to highlight
       | files added, removed, or changed and I use it primarily for
       | reverse engineering projects. One key feature is it can store the
       | snapshot in a sqlite snapshot database that doesn't take much
       | space then you can compare the directory against the stored
       | snapshot database at some point in the future. Diffs are computed
       | based on stat() info: mode, ownership, size, mtime. In addition
       | to a CLI it also provides a python API so you can use it directly
       | in your own code.
        
       ___________________________________________________________________
       (page generated 2024-01-22 23:01 UTC)