[HN Gopher] CLI tool, written in Rust, to diff directory snapshots
___________________________________________________________________
CLI tool, written in Rust, to diff directory snapshots
Author : jotaen
Score : 32 points
Date : 2024-01-22 19:14 UTC (3 hours ago)
(HTM) web link (www.jotaen.net)
(TXT) w3m dump (www.jotaen.net)
| hartator wrote:
| Isn't this just Git?
| quacker wrote:
| I was surprised to not see a mention of `git diff` anywhere in
| the post.
|
| Out of curiosity, I setup two directories where the second has
| a new file, a modified file, and a removed file compared to the
| first: $ tree dir1/ dir2/ dir1/
| +-- empty.txt +-- modified.txt +-- remove-
| this.txt +-- unchanged.txt dir2/ +--
| empty.txt +-- modified.txt +-- new-file.txt
| +-- unchanged.txt
|
| Then `git diff --name-status dir1 dir2` outputs the following,
| showing the changes by file name. $ git diff
| --name-status dir1 dir2 M dir1/modified.txt
| A dir2/new-file.txt D dir1/remove-this.txt
|
| This also doesn't require running `git init` on the two
| directories either, so it's immediately usable out of the box.
| mylittlebrain wrote:
| That was my first thought. A content addressable file system*
| like Git or Mercurial, would be the simplest thing that could
| work. * Not sure these VCSs could be called a CAS.
| jotaen wrote:
| I suppose you could make a similar thing happen based on git,
| or maybe even a combination of `find` and `diff`.
|
| It's certainly an interesting debate whether to use a general-
| purpose approach vs creating a dedicated and fully customisable
| implementation. In this case, I was interested in the exact
| numbers and output structure that snapdiff produces. I'm not
| sure what it would take to make the same output happen by using
| git, or how well that would work on directory sizes in (or
| beyond) the 100.000 files / 100 GB region.
|
| If someone would be up for trying that out and sharing their
| insights, I'd be interested to learn about it.
| codetrotter wrote:
| OP should have a look at ZFS. With large amounts of data I feel
| that ZFS snapshots might be far more time efficient to compare
| than diffing full directories.
|
| Bonus: FreeBSD is currently considering adding Rust to their base
| system. They have ZFS natively in FreeBSD already. Perhaps OP
| will find joy in FreeBSD :D
| timetraveller26 wrote:
| Yes, zfs rules. If you are doing this regularly you should
| consider a fs like zfs or use other solution like Borg (which
| is based on git).
|
| Though this has the convenience of being a more universal
| solution.
| jotaen wrote:
| I haven't looked into ZFS yet, but thanks to your comment, it's
| now on my todo list.
|
| One idea behind my implementation was to have something that's
| more agnostic of specific file systems. But I guess that's an
| aspect that may be worth to reconsider.
| gumby wrote:
| Why does it matter if it is written in Rust or assembly code?
| omaranto wrote:
| I think if you do write something in Rust it is customary to
| mention it to avoid getting tons of suggestions to rewrite it
| in Rust.
| jacquesm wrote:
| Because 'in Rust' is good for at least 30 upvotes.
| timetraveller26 wrote:
| They are special prompts for the HN LLM
| throwaway8582 wrote:
| Aside from arguments about performance and memory safety, I'm
| generally more likely to try something written in Rust (or Go)
| because projects in those languages tends to be easy to build
| or download as a static binary. For Rust projects, `cargo
| install <name>` generally works. On the other hand, when I see
| something written in C++ or Python, it's an indicator that
| there may be significantly more work involved
| jotaen wrote:
| I get the suspicious sentiment, but I mentioned it for other
| reasons in the title. Apart from solving a personal need, this
| project is largely about tinkering with Rust and performance
| optimisations. I was hoping that mentioning the language
| prominently would help attract people that may give valuable
| feedback regarding those things.
| rpigab wrote:
| This looks nice! I like TUI programs for this purpose.
|
| Previously, I've used Beyond Compare 4 by Scooter Software (GUI,
| free to try), it's nice to have more options because diff -r
| doesn't get you very far.
|
| I also like the ability to find duplicate images or any files
| regardless of location with Czkawka (Github qarmin/czkawka).
| grow2grow wrote:
| > (For some extra "entropy", by the way, snapdiff also takes the
| file size into consideration when comparing files.)
|
| I get what you mean by "entropy", but wouldn't it be more direct
| to just say matching hashes will have their file sizes compared
| as a remedy to the collision?
|
| Thanks for the effort, the "back yard diy" utilities are great
| for learning, especially with an accompanying article.
| jotaen wrote:
| Yeah, fair point - I couldn't think of a better word, that's
| why I wrapped it in quotes. I've tried to simplified the
| phrasing:
|
| > For some extra safety margin to avoid collisions, by the way,
| `snapdiff` also takes the file size into consideration when
| comparing files.
| mustache_kimono wrote:
| Curious -- how does your system handle hardlinks? I have a ZFS
| system which does a ZFS "roll forward": --roll-
| forward=<ROLL_FORWARD> traditionally 'zfs rollback' is a
| destructive operation, whereas httm roll-forward is non-
| destructive. httm will copy only files and their attributes that
| have changed since a specified snapshot, from that snapshot, to
| its live dataset...
|
| One of more difficult problems I had to deal with was hardlink
| resolution. Basically I still have to scan the whole dataset, and
| create a map of hardlinks before any run.
| jotaen wrote:
| Oh thanks, that's a good point. I actually haven't considered
| hardlinks at all yet.
|
| I should probably look into that at some point -
| https://github.com/jotaen/snapdiff/issues/2
| sys_64738 wrote:
| What does this do that plain old "diff dir1 dir2" command
| doesn't?
| rhettbull wrote:
| I've created a similar tool in python that I find quite useful:
| https://github.com/RhetTbull/dirsnapshot This doesn't compute
| content hashes as your tool does but is designed to highlight
| files added, removed, or changed and I use it primarily for
| reverse engineering projects. One key feature is it can store the
| snapshot in a sqlite snapshot database that doesn't take much
| space then you can compare the directory against the stored
| snapshot database at some point in the future. Diffs are computed
| based on stat() info: mode, ownership, size, mtime. In addition
| to a CLI it also provides a python API so you can use it directly
| in your own code.
___________________________________________________________________
(page generated 2024-01-22 23:01 UTC)