[HN Gopher] Go Find Duplicates: A fast and simple tool to find d...
___________________________________________________________________
Go Find Duplicates: A fast and simple tool to find duplicate files
Author : ingve
Score : 41 points
Date : 2021-08-29 11:14 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| ColinWright wrote:
| Over the years I've used many, many tools intended to solve this
| problem. In the end, after much frustration, I just use existing
| tools, glued together in a un*x manner. find *
| -type f -exec md5sum '{}' ';' \ | tee /tmp/index_file.txt
| \ | gawk '{print $1}' \ | sort |
| uniq -c \ | gawk '/^ *1 /{ print $2 }
| \ > /tmp/duplicates.txt for m in $( cat
| /tmp/duplicates.txt ) do grep $m
| /tmp/index_file.txt echo ======== done \ |
| less
|
| Tweak as necessary. I do have a comparison executable that only
| compares sizes and sub-portions to save time, but I generally
| find it's not worth it.
|
| It takes less time to type this that than it does to remember
| what some random other tool is called, or how to use it. I also
| have saved a variant that identifies similar files, and another
| that identifies directory structures with lots of shared files,
| but those are (understandably) more complex (and fragile).
| fintler wrote:
| If you want something that scales horizontally (mostly), dcmp
| from https://github.com/hpc/mpifileutils is an option. It can
| chunk up files and do the comparison in parallel on multiple
| servers.
| HumblyTossed wrote:
| Does it only find duplicate files or will it also find duplicate
| directory hierarchies?
|
| Example:
|
| /some/location/one/January/Photos
|
| /some/location/two/January/Photos
|
| I need a tool that would return a match on January directory.
|
| It would be great to be able to filter things. So for example, if
| I have backups of my dev folder, I want to filter out all the
| virtual envs (venv below): /home/HumblyTossed/dev/venv/bin
| /home/HumblyTossed/backups/dev/venv/bin
| chalcolithic wrote:
| If you downvote HumblyTossed's comment please explain why.
| scns wrote:
| <irony> RESF checking in </irony>
|
| The first one i found and still use when it got obvious that
| fslint is EOL is czkawka [0] (meaning hiccup in polish). Its'
| speed is an order of magnitude higher than fslint, memory use is
| 20%-75%.
|
| <;)> Satisfied customer, would buy it again. </;)>
|
| [0] https://github.com/qarmin/czkawka
| idoubtit wrote:
| As far as I know, the standard tool for this is rdfind. This new
| tool claims to be "blazingly fast", so it should provide
| something to show it. Ideally a comparison with rdfind, but even
| a basic benchmark would make it less dubious.
| https://github.com/pauldreik/rdfind
|
| But the main problem is not the suspicious performance, it's the
| lack of explanation. The tool is supposed to "find duplicate
| files (photos, videos, music, documents)". Does it mean it is
| restricted to some file types? Does it find identical photos with
| different metadata to be duplicates? Compare this with rdfind
| which clearly describes what it does, provides a summary of its
| algorithm, and even mentions alternatives.
|
| Overall, it may be a fine toy/hobby project (3 commits only, 3
| months ago), I didn't read the code (except for finding the
| command-line options). I don't get why it got so much attention.
| justinsaccount wrote:
| Yeah, this tool does not appear to be very good, especially
| compared to established alternatives.
|
| It initially groups files that have the "same extension and
| same size", so you're out of luck if you have two copies named
| foo.jpg and foo.jpeg.
|
| Then, it cheats by computing a crc32 (!) of the beginning,
| middle, and end bytes of the file and groups together files
| that have the same crc32.
|
| So, it'll mostly work, but miss a lot of duplicates, and
| potentially flag different files as duplicates.
| artemisart wrote:
| See also fclones (focuses on performance, has benchmarks
| https://github.com/pkolaczk/fclones). I didn't know about
| rdfind but thought the standard was fdupes
| https://github.com/adrianlopezroche/fdupes, which is as fast
| (or slow) as rdfind according to fclones (and fclones is much
| faster).
| justinsaccount wrote:
| afaik fdupes is super slow because it checksums entire files
| in order to find duplicates. This causes a ton of unnecessary
| IO if you have a lot of size collisions.
|
| The efficient way to do things is to just read files in
| parallel and break once they diverge. Basically how `cmp`
| works.
| pvaldes wrote:
| Rdfind is the logical evolution of fdupes. Not only faster
| but also more clever. This does not mean that fdupes is a
| bad tool at all, but rdfind can do things that fdupes
| can't.
|
| Example. For fileX.txt being X=1 to 10000 you have three
| copies of each archive in:
|
| mydir/file-X.txt
|
| mydir/subdir/file-X.txt and
|
| mydir/subdir/copy/file-X.txt.
|
| Fdupes would delete random files in mydir/, mydir/subdir/
| and mydir/subdir/copy/. You would end with the remaining
| files scattered by all the directory tree. A mess with
| three incomplete copies.
|
| Rdfind correctly guess that what most people would want is
| to remove entirely all files in two of the directories and
| keep one copy (files and tree-dir) intact. So it wipes the
| inner subdirs in a predictable way and keeps the outer dir
| intact. This is a terrific feature able to disentangle one
| tree directory cloned and nested into the original copy
| without distroying it, like in this case:
|
| a/b/c/d/files00X.txt
|
| a/b/a/b/c/d/files00X.txt
| diskzero wrote:
| I think we need a lookup table of marketing speech to real-
| world performance metrics. Blazingly fast has been showing up a
| lot lately.
|
| The cynical side of me wants to know what features and safety
| checks a "blazingly fast" tool has not implemented that the
| older "glacially slow" tool it is replacing ended up
| implementing after all the edge conditions were uncovered.
| code_biologist wrote:
| I've found rmlint to be another very good tool in this space:
| https://github.com/sahib/rmlint
| andmarios wrote:
| A shameless plug but it is a simple --and probably bad written--
| tool I made many years ago to scratch an itch, and I still use
| it.
|
| It finds duplicate files and replaces them with hard links,
| saving you space. Just make sure you provide it with paths in the
| same filesystem.
|
| I originally wrote it to save some space from personal files
| (videos, photos, etc), but it turned out very useful for tar
| files, docker images, websites, and more. For example I maintain
| a tar file and a docker image with Kafka connectors which share
| many jar files. Using duphard I can save hundreds of megabytes,
| or even more than a gigabyte! For a documentation website with
| many copies of the same image (let's just say some static
| generators favor this practice for maintaining multiple
| versions), I can reduce the website size by 60%+, which then
| makes ssh copies, docker pulls, etc way faster speeding up
| deployment times.
|
| https://github.com/andmarios/duphard
| [deleted]
| matzf wrote:
| Ah yes, that's exactly what CRC32 is supposed to be used for. And
| it's even quicker if you don't compute it over the whole file,
| brilliant!
| Bostonian wrote:
| On Windows if you download the same file more than once you will
| have foo.doc, "foo (1).doc", "foo (2).doc" etc. A script that
| just looked for files with such names, compared them to foo.doc,
| and deleted them if they are the same would be useful.
| sumtechguy wrote:
| http://malich.ru/duplicate_searcher
|
| I have had pretty good luck with that one. I used to use
| 'duplicate commander' but I am not sure that one is out there
| anymore.
| yandrypozo wrote:
| I'm curious, why did the author read only on three sections for
| each file? is related on how CRC32 works?
___________________________________________________________________
(page generated 2021-08-30 23:01 UTC)