[HN Gopher] Go Find Duplicates: A fast and simple tool to find d...
       ___________________________________________________________________
        
       Go Find Duplicates: A fast and simple tool to find duplicate files
        
       Author : ingve
       Score  : 41 points
       Date   : 2021-08-29 11:14 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | ColinWright wrote:
       | Over the years I've used many, many tools intended to solve this
       | problem. In the end, after much frustration, I just use existing
       | tools, glued together in a un*x manner.                 find *
       | -type f -exec md5sum '{}' ';' \       | tee /tmp/index_file.txt
       | \       | gawk '{print $1}'                  \       | sort |
       | uniq -c                     \       | gawk '/^ *1 /{ print $2 }
       | \       > /tmp/duplicates.txt            for m in $( cat
       | /tmp/duplicates.txt )       do         grep $m
       | /tmp/index_file.txt         echo ========         done \       |
       | less
       | 
       | Tweak as necessary. I do have a comparison executable that only
       | compares sizes and sub-portions to save time, but I generally
       | find it's not worth it.
       | 
       | It takes less time to type this that than it does to remember
       | what some random other tool is called, or how to use it. I also
       | have saved a variant that identifies similar files, and another
       | that identifies directory structures with lots of shared files,
       | but those are (understandably) more complex (and fragile).
        
       | fintler wrote:
       | If you want something that scales horizontally (mostly), dcmp
       | from https://github.com/hpc/mpifileutils is an option. It can
       | chunk up files and do the comparison in parallel on multiple
       | servers.
        
       | HumblyTossed wrote:
       | Does it only find duplicate files or will it also find duplicate
       | directory hierarchies?
       | 
       | Example:
       | 
       | /some/location/one/January/Photos
       | 
       | /some/location/two/January/Photos
       | 
       | I need a tool that would return a match on January directory.
       | 
       | It would be great to be able to filter things. So for example, if
       | I have backups of my dev folder, I want to filter out all the
       | virtual envs (venv below): /home/HumblyTossed/dev/venv/bin
       | /home/HumblyTossed/backups/dev/venv/bin
        
         | chalcolithic wrote:
         | If you downvote HumblyTossed's comment please explain why.
        
       | scns wrote:
       | <irony> RESF checking in </irony>
       | 
       | The first one i found and still use when it got obvious that
       | fslint is EOL is czkawka [0] (meaning hiccup in polish). Its'
       | speed is an order of magnitude higher than fslint, memory use is
       | 20%-75%.
       | 
       | <;)> Satisfied customer, would buy it again. </;)>
       | 
       | [0] https://github.com/qarmin/czkawka
        
       | idoubtit wrote:
       | As far as I know, the standard tool for this is rdfind. This new
       | tool claims to be "blazingly fast", so it should provide
       | something to show it. Ideally a comparison with rdfind, but even
       | a basic benchmark would make it less dubious.
       | https://github.com/pauldreik/rdfind
       | 
       | But the main problem is not the suspicious performance, it's the
       | lack of explanation. The tool is supposed to "find duplicate
       | files (photos, videos, music, documents)". Does it mean it is
       | restricted to some file types? Does it find identical photos with
       | different metadata to be duplicates? Compare this with rdfind
       | which clearly describes what it does, provides a summary of its
       | algorithm, and even mentions alternatives.
       | 
       | Overall, it may be a fine toy/hobby project (3 commits only, 3
       | months ago), I didn't read the code (except for finding the
       | command-line options). I don't get why it got so much attention.
        
         | justinsaccount wrote:
         | Yeah, this tool does not appear to be very good, especially
         | compared to established alternatives.
         | 
         | It initially groups files that have the "same extension and
         | same size", so you're out of luck if you have two copies named
         | foo.jpg and foo.jpeg.
         | 
         | Then, it cheats by computing a crc32 (!) of the beginning,
         | middle, and end bytes of the file and groups together files
         | that have the same crc32.
         | 
         | So, it'll mostly work, but miss a lot of duplicates, and
         | potentially flag different files as duplicates.
        
         | artemisart wrote:
         | See also fclones (focuses on performance, has benchmarks
         | https://github.com/pkolaczk/fclones). I didn't know about
         | rdfind but thought the standard was fdupes
         | https://github.com/adrianlopezroche/fdupes, which is as fast
         | (or slow) as rdfind according to fclones (and fclones is much
         | faster).
        
           | justinsaccount wrote:
           | afaik fdupes is super slow because it checksums entire files
           | in order to find duplicates. This causes a ton of unnecessary
           | IO if you have a lot of size collisions.
           | 
           | The efficient way to do things is to just read files in
           | parallel and break once they diverge. Basically how `cmp`
           | works.
        
             | pvaldes wrote:
             | Rdfind is the logical evolution of fdupes. Not only faster
             | but also more clever. This does not mean that fdupes is a
             | bad tool at all, but rdfind can do things that fdupes
             | can't.
             | 
             | Example. For fileX.txt being X=1 to 10000 you have three
             | copies of each archive in:
             | 
             | mydir/file-X.txt
             | 
             | mydir/subdir/file-X.txt and
             | 
             | mydir/subdir/copy/file-X.txt.
             | 
             | Fdupes would delete random files in mydir/, mydir/subdir/
             | and mydir/subdir/copy/. You would end with the remaining
             | files scattered by all the directory tree. A mess with
             | three incomplete copies.
             | 
             | Rdfind correctly guess that what most people would want is
             | to remove entirely all files in two of the directories and
             | keep one copy (files and tree-dir) intact. So it wipes the
             | inner subdirs in a predictable way and keeps the outer dir
             | intact. This is a terrific feature able to disentangle one
             | tree directory cloned and nested into the original copy
             | without distroying it, like in this case:
             | 
             | a/b/c/d/files00X.txt
             | 
             | a/b/a/b/c/d/files00X.txt
        
         | diskzero wrote:
         | I think we need a lookup table of marketing speech to real-
         | world performance metrics. Blazingly fast has been showing up a
         | lot lately.
         | 
         | The cynical side of me wants to know what features and safety
         | checks a "blazingly fast" tool has not implemented that the
         | older "glacially slow" tool it is replacing ended up
         | implementing after all the edge conditions were uncovered.
        
         | code_biologist wrote:
         | I've found rmlint to be another very good tool in this space:
         | https://github.com/sahib/rmlint
        
       | andmarios wrote:
       | A shameless plug but it is a simple --and probably bad written--
       | tool I made many years ago to scratch an itch, and I still use
       | it.
       | 
       | It finds duplicate files and replaces them with hard links,
       | saving you space. Just make sure you provide it with paths in the
       | same filesystem.
       | 
       | I originally wrote it to save some space from personal files
       | (videos, photos, etc), but it turned out very useful for tar
       | files, docker images, websites, and more. For example I maintain
       | a tar file and a docker image with Kafka connectors which share
       | many jar files. Using duphard I can save hundreds of megabytes,
       | or even more than a gigabyte! For a documentation website with
       | many copies of the same image (let's just say some static
       | generators favor this practice for maintaining multiple
       | versions), I can reduce the website size by 60%+, which then
       | makes ssh copies, docker pulls, etc way faster speeding up
       | deployment times.
       | 
       | https://github.com/andmarios/duphard
        
         | [deleted]
        
       | matzf wrote:
       | Ah yes, that's exactly what CRC32 is supposed to be used for. And
       | it's even quicker if you don't compute it over the whole file,
       | brilliant!
        
       | Bostonian wrote:
       | On Windows if you download the same file more than once you will
       | have foo.doc, "foo (1).doc", "foo (2).doc" etc. A script that
       | just looked for files with such names, compared them to foo.doc,
       | and deleted them if they are the same would be useful.
        
         | sumtechguy wrote:
         | http://malich.ru/duplicate_searcher
         | 
         | I have had pretty good luck with that one. I used to use
         | 'duplicate commander' but I am not sure that one is out there
         | anymore.
        
       | yandrypozo wrote:
       | I'm curious, why did the author read only on three sections for
       | each file? is related on how CRC32 works?
        
       ___________________________________________________________________
       (page generated 2021-08-30 23:01 UTC)