[HN Gopher] Duperemove - Tools for deduping file systems
___________________________________________________________________
Duperemove - Tools for deduping file systems
Author : anotherhue
Score : 27 points
Date : 2024-08-08 15:38 UTC (7 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| doctorpangloss wrote:
| Microsoft's File Deduplication works really well. But I think
| they only enable it in Windows Server because it would truly cut
| PC profits in half if people didn't run out of storage all the
| time.
| sheepdestroyer wrote:
| But Microsoft does not sell local storage?
| mrguyorama wrote:
| Normal People do not run out of storage on their PCs. What data
| do you think normal people even have? Everything is streamed.
| My girlfriend has kept everything that has ever mattered to her
| for 15 years now, including two college degrees worth of class
| material, in the free tier of OneDrive, like a few gigs maybe.
|
| Her largest digital asset is a gig of mp3s that are a copy of
| the Harry Potter audiobooks. Even that is "streamed" from my
| google drive. Even getting her into the hobby of "collecting
| steam games" hasn't put pressure on her 500gb hard drive,
| because she doesn't play modern AAA games.
|
| People run out of storage on their phone from taking Video and
| Pictures but they don't ever copy those over to their PCs, if
| they even have PCs, they just buy more cloud storage space from
| Apple.
| floam wrote:
| And common ways of filling up your storage, like big photo or
| music libraries, are unlikely to have many dupes.
| doctorpangloss wrote:
| > Normal People do not run out of storage on their PCs... My
| girlfriend...
|
| Normal people run out of storage on their 128GB SSD macOS
| devices all the time.
| zeotroph wrote:
| That only seems to work on btrfs, XFS, (and maybe now or very
| soon) ZFS and bcachefs: "[duperemove] simply finds candidates for
| dedupe and submits them to the Linux kernel FIDEDUPERANGE ioctl."
| [1] (aka BTRFS_IOC_FILE_EXTENT_SAME), and this ioctl "performs
| the 'compare and share if identical'" (and locking etc.) work
| [2]. But on those filesystems, that is a nice feature, plus it
| lets the tool get away with a weak hash like murmur3.
|
| 1: http://markfasheh.github.io/duperemove/duperemove.html 2:
| https://manpages.debian.org/bookworm/manpages-dev/ioctl_fide...
| Alifatisk wrote:
| I have recently stumbled upon a challenge with deduping lots of
| media. I'm talking about both images and pictures. They are named
| in different ways, comes in different image and video formats,
| and are spread out throughout a whole directory and deeply
| nested. In total, it's about 170 GB of content.
|
| My approach have been this: - Flatten out all the deeply nested
| folders - Categorize the content into folders with the year as
| name - Run deduplication and set the threshold to 100% match (I
| really don't want to loose any content) - Sync to Google photos
|
| Have anyone else faced similar challenges?
| 0cf8612b2e1e wrote:
| If you are only willing to drop exact duplicates, why not just
| take a file hash of everything as your starting point? Delete
| all but one of the collisions.
| Alifatisk wrote:
| DupeGuru has been pretty effective solution for this. I don't
| know if creating a hash for every file and then checking for
| collision is any better way? Sounds like it requires more
| labor.
| 0cf8612b2e1e wrote:
| find . -type f -exec sha256sum {} \; | sort
|
| Gets you half way there.
| corndoge wrote:
| I think lots of people have faced this challenge. There are
| several tools that will find and allow you to purge exact
| duplicates, such as dupeguru.
| fluential wrote:
| I've been processing a lot of niche audio files often same tracks
| but trimmed in length - jdupe [1] worked really well.
|
| https://codeberg.org/jbruchon/jdupes
| abspoel wrote:
| A while back I wrote a simple tool to remove or symlink duplicate
| files (exact matches). It's faster than fully hashing each file,
| maybe it's useful to some folks: https://github.com/abspoel/dedup
___________________________________________________________________
(page generated 2024-08-08 23:01 UTC)