[HN Gopher] Duperemove - Tools for deduping file systems
       ___________________________________________________________________
        
       Duperemove - Tools for deduping file systems
        
       Author : anotherhue
       Score  : 27 points
       Date   : 2024-08-08 15:38 UTC (7 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | doctorpangloss wrote:
       | Microsoft's File Deduplication works really well. But I think
       | they only enable it in Windows Server because it would truly cut
       | PC profits in half if people didn't run out of storage all the
       | time.
        
         | sheepdestroyer wrote:
         | But Microsoft does not sell local storage?
        
         | mrguyorama wrote:
         | Normal People do not run out of storage on their PCs. What data
         | do you think normal people even have? Everything is streamed.
         | My girlfriend has kept everything that has ever mattered to her
         | for 15 years now, including two college degrees worth of class
         | material, in the free tier of OneDrive, like a few gigs maybe.
         | 
         | Her largest digital asset is a gig of mp3s that are a copy of
         | the Harry Potter audiobooks. Even that is "streamed" from my
         | google drive. Even getting her into the hobby of "collecting
         | steam games" hasn't put pressure on her 500gb hard drive,
         | because she doesn't play modern AAA games.
         | 
         | People run out of storage on their phone from taking Video and
         | Pictures but they don't ever copy those over to their PCs, if
         | they even have PCs, they just buy more cloud storage space from
         | Apple.
        
           | floam wrote:
           | And common ways of filling up your storage, like big photo or
           | music libraries, are unlikely to have many dupes.
        
           | doctorpangloss wrote:
           | > Normal People do not run out of storage on their PCs... My
           | girlfriend...
           | 
           | Normal people run out of storage on their 128GB SSD macOS
           | devices all the time.
        
       | zeotroph wrote:
       | That only seems to work on btrfs, XFS, (and maybe now or very
       | soon) ZFS and bcachefs: "[duperemove] simply finds candidates for
       | dedupe and submits them to the Linux kernel FIDEDUPERANGE ioctl."
       | [1] (aka BTRFS_IOC_FILE_EXTENT_SAME), and this ioctl "performs
       | the 'compare and share if identical'" (and locking etc.) work
       | [2]. But on those filesystems, that is a nice feature, plus it
       | lets the tool get away with a weak hash like murmur3.
       | 
       | 1: http://markfasheh.github.io/duperemove/duperemove.html 2:
       | https://manpages.debian.org/bookworm/manpages-dev/ioctl_fide...
        
       | Alifatisk wrote:
       | I have recently stumbled upon a challenge with deduping lots of
       | media. I'm talking about both images and pictures. They are named
       | in different ways, comes in different image and video formats,
       | and are spread out throughout a whole directory and deeply
       | nested. In total, it's about 170 GB of content.
       | 
       | My approach have been this: - Flatten out all the deeply nested
       | folders - Categorize the content into folders with the year as
       | name - Run deduplication and set the threshold to 100% match (I
       | really don't want to loose any content) - Sync to Google photos
       | 
       | Have anyone else faced similar challenges?
        
         | 0cf8612b2e1e wrote:
         | If you are only willing to drop exact duplicates, why not just
         | take a file hash of everything as your starting point? Delete
         | all but one of the collisions.
        
           | Alifatisk wrote:
           | DupeGuru has been pretty effective solution for this. I don't
           | know if creating a hash for every file and then checking for
           | collision is any better way? Sounds like it requires more
           | labor.
        
             | 0cf8612b2e1e wrote:
             | find . -type f -exec sha256sum {} \; | sort
             | 
             | Gets you half way there.
        
         | corndoge wrote:
         | I think lots of people have faced this challenge. There are
         | several tools that will find and allow you to purge exact
         | duplicates, such as dupeguru.
        
       | fluential wrote:
       | I've been processing a lot of niche audio files often same tracks
       | but trimmed in length - jdupe [1] worked really well.
       | 
       | https://codeberg.org/jbruchon/jdupes
        
       | abspoel wrote:
       | A while back I wrote a simple tool to remove or symlink duplicate
       | files (exact matches). It's faster than fully hashing each file,
       | maybe it's useful to some folks: https://github.com/abspoel/dedup
        
       ___________________________________________________________________
       (page generated 2024-08-08 23:01 UTC)