https://github.com/markfasheh/duperemove Skip to content Navigation Menu Toggle navigation Sign in * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + GitHub Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code Explore + All features + Documentation + GitHub Skills + Blog * Solutions By size + Enterprise + Teams + Startups By industry + Healthcare + Financial services + Manufacturing By use case + CI/CD & Automation + DevOps + DevSecOps * Resources Topics + AI + DevOps + Security + Software Development Explore + Learning Pathways + White papers, Ebooks, Webinars + Customer Stories + Partners * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Enterprise + Enterprise platform AI-powered developer platform Available add-ons + Advanced Security Enterprise-grade security features + GitHub Copilot Enterprise-grade AI features + Premium Support Enterprise-grade 24/7 support * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert {{ message }} markfasheh / duperemove Public * Notifications You must be signed in to change notification settings * Fork 75 * Star 730 Tools for deduping file systems License GPL-2.0, Unknown licenses found Licenses found GPL-2.0 LICENSE Unknown LICENSE.xxhash 730 stars 75 forks Branches Tags Activity Star Notifications You must be signed in to change notification settings * Code * Issues 30 * Pull requests 0 * Actions * Projects 0 * Wiki * Security * Insights Additional navigation options * Code * Issues * Pull requests * Actions * Projects * Wiki * Security * Insights markfasheh/duperemove This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. master BranchesTags Go to file Code Folders and files Name Name Last commit Last commit message date Latest commit History 951 Commits completion/zsh completion/zsh docs docs markdown markdown .gitignore .gitignore Changelog.md Changelog.md LICENSE LICENSE LICENSE.xxhash LICENSE.xxhash Makefile Makefile README.md README.md SubmittingPatches SubmittingPatches btrfs-extent-same.8 btrfs-extent-same.8 btrfs-extent-same.c btrfs-extent-same.c btrfs-util.c btrfs-util.c btrfs-util.h btrfs-util.h csum-test.c csum-test.c csum.c csum.c csum.h csum.h dbfile.c dbfile.c dbfile.h dbfile.h debug.c debug.c debug.h debug.h dedupe.c dedupe.c dedupe.h dedupe.h duperemove.8 duperemove.8 duperemove.c duperemove.c fiemap.c fiemap.c fiemap.h fiemap.h file_flags.h file_flags.h file_scan.c file_scan.c file_scan.h file_scan.h filerec.c filerec.c filerec.h filerec.h find_dupes.c find_dupes.c find_dupes.h find_dupes.h hash-tree.c hash-tree.c hash-tree.h hash-tree.h hashstats.8 hashstats.8 hashstats.c hashstats.c ioctl.h ioctl.h kernel.h kernel.h list.h list.h list_sort.c list_sort.c list_sort.h list_sort.h memstats.c memstats.c memstats.h memstats.h minunit.h minunit.h opt.c opt.c opt.h opt.h progress.c progress.c progress.h progress.h rbtree.c rbtree.c rbtree.h rbtree.h rbtree.txt rbtree.txt rbtree_augmented.h rbtree_augmented.h results-tree.c results-tree.c results-tree.h results-tree.h run_dedupe.c run_dedupe.c run_dedupe.h run_dedupe.h show-shared-extents show-shared-extents show-shared-extents.8 show-shared-extents.8 tests.c tests.c threads.c threads.c threads.h threads.h util.c util.c util.h util.h xxhash.h xxhash.h View all files Repository files navigation * README * GPL-2.0 license * License Duperemove Duperemove is a simple tool for finding duplicated extents and submitting them for deduplication. When given a list of files it will hash their contents on an extent by extent basis and compare those hashes to each other, finding and categorizing extents that match each other. Optionally, a per-block hash can be applied for further duplication lookup. When given the -d option, duperemove will submit those extents for deduplication using the Linux kernel FIDEDUPRANGE ioctl. Duperemove can store the hashes it computes in a 'hashfile'. If given an existing hashfile, duperemove will only compute hashes for those files which have changed since the last run. Thus you can run duperemove repeatedly on your data as it changes, without having to re-checksum unchanged data. Duperemove can also take input from the fdupes program. See the duperemove man page for further details about running duperemove. Requirements The latest stable code can be found in the release page Kernel: Duperemove needs a kernel version equal to or greater than 3.13 Libraries: Duperemove uses glib2 and sqlite3. It also uses libuuid, libmount and libblkid from util-linux. FAQ Please see the FAQ section in the duperemove man page For bug reports and feature requests please use the github issue tracker Examples Please see the examples section of the duperemove man page for a complete set of usage examples, including hashfile usage. A simple example, with program output Duperemove takes a list of files and directories to scan for dedupe. If a directory is specified, all regular files within it will be scanned. Duperemove can also be told to recursively scan directories with the '-r' switch. If '-h' is provided, duperemove will print numbers in powers of 1024 (e.g., "128K"). Assume this abitrary layout for the following examples. . +-- dir1 | +-- file3 | +-- file4 | +-- subdir1 | +-- file5 +-- file1 +-- file2 This will dedupe files 'file1' and 'file2': duperemove -dh file1 file2 This does the same but adds any files in dir1 (file3 and file4): duperemove -dh file1 file2 dir1 This will dedupe exactly the same as above but will recursively walk dir1, thus adding file5. duperemove -dhr file1 file2 dir1/ An actual run, output will differ according to duperemove version. Using 128K blocks Using hash: murmur3 Using 4 threads for file hashing phase csum: /btrfs/file1 [1/5] (20.00%) csum: /btrfs/file2 [2/5] (40.00%) csum: /btrfs/dir1/subdir1/file5 [3/5] (60.00%) csum: /btrfs/dir1/file3 [4/5] (80.00%) csum: /btrfs/dir1/file4 [5/5] (100.00%) Total files: 5 Total hashes: 80 Loading only duplicated hashes from hashfile. Hashing completed. Calculating duplicate extents - this may take some time. Simple read and compare of file data found 3 instances of extents that might benefit from deduplication. Showing 2 identical extents of length 512.0K with id 0971ffa6 Start Filename 512.0K "/btrfs/file1" 1.5M "/btrfs/dir1/file4" Showing 2 identical extents of length 1.0M with id b34ffe8f Start Filename 0.0 "/btrfs/dir1/file4" 0.0 "/btrfs/dir1/file3" Showing 3 identical extents of length 1.5M with id f913dceb Start Filename 0.0 "/btrfs/file2" 0.0 "/btrfs/dir1/file3" 0.0 "/btrfs/dir1/subdir1/file5" Using 4 threads for dedupe phase [0x147f4a0] Try to dedupe extents with id 0971ffa6 [0x147f770] Try to dedupe extents with id b34ffe8f [0x147f680] Try to dedupe extents with id f913dceb [0x147f4a0] Dedupe 1 extents (id: 0971ffa6) with target: (512.0K, 512.0K), "/btrfs/file1" [0x147f770] Dedupe 1 extents (id: b34ffe8f) with target: (0.0, 1.0M), "/btrfs/dir1/file4" [0x147f680] Dedupe 2 extents (id: f913dceb) with target: (0.0, 1.5M), "/btrfs/file2" Kernel processed data (excludes target files): 4.5M Comparison of extent info shows a net change in shared extents of: 5.5M Links of interest * The duperemove wiki has both design and performance documentation. * duperemove-tests has a growing assortment of regression tests. * Duperemove web page About Tools for deduping file systems Resources Readme License GPL-2.0, Unknown licenses found Licenses found GPL-2.0 LICENSE Unknown LICENSE.xxhash Activity Stars 730 stars Watchers 40 watching Forks 75 forks Report repository Releases 5 v0.14.1 Latest Nov 25, 2023 + 4 releases Packages 0 No packages published Contributors 30 * @JackSlateur * @lorddoskias * @nefelim4ag * @markfasheh * @trofi * @matthiaskrgr * @ribbons * @ericzinnikas * @Gelma * @moben * @cuihaoleo * @Gottox * @lpirl * @petzah + 16 contributors Languages * C 55.3% * C++ 39.9% * Roff 3.8% * Other 1.0% Footer (c) 2024 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact * Manage cookies * Do not share my personal information You can't perform that action at this time.