https://github.com/hartator/wayback-machine-downloader/ Skip to content Sign up * Why GitHub? Features - + Mobile - + Actions - + Codespaces - + Packages - + Security - + Code review - + Issues - + Integrations - + GitHub Sponsors - + Customer stories- * Team * Enterprise * Explore + Explore GitHub - Learn and contribute + Topics - + Collections - + Trending - + Learning Lab - + Open source guides - Connect with others + The ReadME Project - + Events - + Community forum - + GitHub Education - + GitHub Stars program - * Marketplace * Pricing Plans - + Compare plans - + Contact Sales - + Education - [ ] [search-key] * # In this repository All GitHub | Jump to | * No suggested jump to results * # In this repository All GitHub | Jump to | * # In this user All GitHub | Jump to | * # In this repository All GitHub | Jump to | Sign in Sign up {{ message }} hartator / wayback-machine-downloader * Notifications * Star 3.2k * Fork 448 Download an entire website from the Wayback Machine. View license 3.2k stars 448 forks Star Notifications * Code * Issues 94 * Pull requests 7 * Actions * Projects 0 * Wiki * Security * Insights More * Code * Issues * Pull requests * Actions * Projects * Wiki * Security * Insights master Switch branches/tags [ ] Branches Tags Could not load branches Nothing to show {{ refName }} default View all branches Could not load tags Nothing to show {{ refName }} default View all tags 3 branches 7 tags Code Clone HTTPS GitHub CLI [https://github.com/h] Use Git or checkout with SVN using the web URL. [gh repo clone hartat] Work fast with our official CLI. Learn more. * Open with GitHub Desktop * Download ZIP Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Go back Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Go back Launching Xcode If nothing happens, download Xcode and try again. Go back Launching Visual Studio Code Your codespace will open once ready. There was a problem preparing your codespace, please try again. Latest commit @hartator hartator Merge pull request #188 from pabs3/fixes ... 66ff4d9 Jun 7, 2021 Merge pull request #188 from pabs3/fixes Fix various issues 66ff4d9 Git stats * 288 commits Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time bin Fix typos May 3, 2021 lib Bump Gem version Jun 7, 2021 test Merge branch 'pr/89' Oct 27, 2017 .gitignore Instruct git to ignore a file generated by the tests May 3, 2021 .travis.yml Remove Rubinius from Rubies following Travis CI issues Nov 15, 2016 Dockerfile Bump Ruby version to 2.3 Jul 29, 2016 Gemfile Add Bundler to ease development and allow Travis to build correctly Nov 4, 2015 MIT-LICENSE.txt Add MIT License Mar 26, 2016 README.md Fix typos May 3, 2021 Rakefile Add placeholder test files Jul 26, 2015 wayback_machine_downloader.gemspec Refactor archive API calls to own module Sep 17, 2016 View code Wayback Machine Downloader Installation Basic Usage How it works Advanced Usage Specify directory to save files to All Timestamps From Timestamp To Timestamp Exact Url Only URL Filter Exclude URL Filter Expand downloading to all file types Only list files without downloading Maximum number of snapshot pages to consider Download multiple files at a time Using the Docker image Contributing README.md Wayback Machine Downloader Gem Version Build Status Download an entire website from the Internet Archive Wayback Machine. Installation You need to install Ruby on your system (>= 1.9.2) - if you don't already have it. Then run: gem install wayback_machine_downloader Tip: If you run into permission errors, you might have to add sudo in front of this command. Basic Usage Run wayback_machine_downloader with the base url of the website you want to retrieve as a parameter (e.g., http://example.com): wayback_machine_downloader http://example.com How it works It will download the last version of every file present on Wayback Machine to ./websites/example.com/. It will also re-create a directory structure and auto-create index.html pages to work seamlessly with Apache and Nginx. All files downloaded are the original ones and not Wayback Machine rewritten versions. This way, URLs and links structure are the same as before. Advanced Usage Usage: wayback_machine_downloader http://example.com Download an entire website from the Wayback Machine. Optional options: -d, --directory PATH Directory to save the downloaded files into Default is ./websites/ plus the domain name -s, --all-timestamps Download all snapshots/timestamps for a given website -f, --from TIMESTAMP Only files on or after timestamp supplied (ie. 20060716231334) -t, --to TIMESTAMP Only files on or before timestamp supplied (ie. 20100916231334) -e, --exact-url Download only the url provied and not the full site -o, --only ONLY_FILTER Restrict downloading to urls that match this filter (use // notation for the filter to be treated as a regex) -x, --exclude EXCLUDE_FILTER Skip downloading of urls that match this filter (use // notation for the filter to be treated as a regex) -a, --all Expand downloading to error files (40x and 50x) and redirections (30x) -c, --concurrency NUMBER Number of multiple files to download at a time Default is one file at a time (ie. 20) -p, --maximum-snapshot NUMBER Maximum snapshot pages to consider (Default is 100) Count an average of 150,000 snapshots per page -l, --list Only list file urls in a JSON format with the archived timestamps, won't download anything Specify directory to save files to -d, --directory PATH Optional. By default, Wayback Machine Downloader will download files to ./websites/ followed by the domain name of the website. You may want to save files in a specific directory using this option. Example: wayback_machine_downloader http://example.com --directory downloaded-backup/ All Timestamps -s, --all-timestamps Optional. This option will download all timestamps/snapshots for a given website. It will uses the timestamp of each snapshot as directory. Example: wayback_machine_downloader http://example.com --all-timestamps Will download: websites/example.com/20060715085250/index.html websites/example.com/20051120005053/index.html websites/example.com/20060111095815/img/logo.png ... From Timestamp -f, --from TIMESTAMP Optional. You may want to supply a from timestamp to lock your backup to a specific version of the website. Timestamps can be found inside the urls of the regular Wayback Machine website (e.g., https:// web.archive.org/web/20060716231334/http://example.com). You can also use years (2006), years + month (200607), etc. It can be used in combination of To Timestamp. Wayback Machine Downloader will then fetch only file versions on or after the timestamp specified. Example: wayback_machine_downloader http://example.com --from 20060716231334 To Timestamp -t, --to TIMESTAMP Optional. You may want to supply a to timestamp to lock your backup to a specific version of the website. Timestamps can be found inside the urls of the regular Wayback Machine website (e.g., https:// web.archive.org/web/20100916231334/http://example.com). You can also use years (2010), years + month (201009), etc. It can be used in combination of From Timestamp. Wayback Machine Downloader will then fetch only file versions on or before the timestamp specified. Example: wayback_machine_downloader http://example.com --to 20100916231334 Exact Url -e, --exact-url Optional. If you want to retrieve only the file matching exactly the url provided, you can use this flag. It will avoid downloading anything else. For example, if you only want to download only the html homepage file of example.com: wayback_machine_downloader http://example.com --exact-url Only URL Filter -o, --only ONLY_FILTER Optional. You may want to retrieve files which are of a certain type (e.g., .pdf, .jpg, .wrd...) or are in a specific directory. To do so, you can supply the --only flag with a string or a regex (using the '/ regex/' notation) to limit which files Wayback Machine Downloader will download. For example, if you only want to download files inside a specific my_directory: wayback_machine_downloader http://example.com --only my_directory Or if you want to download every images without anything else: wayback_machine_downloader http://example.com --only "/\.(gif|jpg|jpeg)$/i" Exclude URL Filter -x, --exclude EXCLUDE_FILTER Optional. You may want to retrieve files which aren't of a certain type (e.g., .pdf, .jpg, .wrd...) or aren't in a specific directory. To do so, you can supply the --exclude flag with a string or a regex (using the '/regex/' notation) to limit which files Wayback Machine Downloader will download. For example, if you want to avoid downloading files inside my_directory: wayback_machine_downloader http://example.com --exclude my_directory Or if you want to download everything except images: wayback_machine_downloader http://example.com --exclude "/\.(gif|jpg|jpeg)$/i" Expand downloading to all file types -a, --all Optional. By default, Wayback Machine Downloader limits itself to files that responded with 200 OK code. If you also need errors files (40x and 50x codes) or redirections files (30x codes), you can use the --all or -a flag and Wayback Machine Downloader will download them in addition of the 200 OK files. It will also keep empty files that are removed by default. Example: wayback_machine_downloader http://example.com --all Only list files without downloading -l, --list It will just display the files to be downloaded with their snapshot timestamps and urls. The output format is JSON. It won't download anything. It's useful for debugging or to connect to another application. Example: wayback_machine_downloader http://example.com --list Maximum number of snapshot pages to consider -p, --snapshot-pages NUMBER Optional. Specify the maximum number of snapshot pages to consider. Count an average of 150,000 snapshots per page. 100 is the default maximum number of snapshot pages and should be sufficient for most websites. Use a bigger number if you want to download a very large website. Example: wayback_machine_downloader http://example.com --snapshot-pages 300 Download multiple files at a time -c, --concurrency NUMBER Optional. Specify the number of multiple files you want to download at the same time. Allows one to speed up the download of a website significantly. Default is to download one file at a time. Example: wayback_machine_downloader http://example.com --concurrency 20 Using the Docker image As an alternative installation way, we have a Docker image! Retrieve the wayback-machine-downloader Docker image this way: docker pull hartator/wayback-machine-downloader Then, you should be able to use the Docker image to download websites. For example: docker run --rm -it -v $PWD/websites:/websites hartator/wayback-machine-downloader http://example.com Contributing Contributions are welcome! Just submit a pull request via GitHub. To run the tests: bundle install bundle exec rake test About Download an entire website from the Wayback Machine. Resources Readme License View license Releases 7 2.3.0 Latest Jun 7, 2021 + 6 releases Packages 0 No packages published Used by 67 * @tubleronchik * @Baughn * @makerdao * @benmezger * @yvan-sraka * @NixOS * @pasqui23 * @rsynnest + 59 Contributors 16 * @hartator * @p * @pabs3 * @TheMonkeyz * @ikirker * @wuxmedia * @insaner * @tamersalama * @niklasjansson * @tedder * @ksarunas + 5 contributors Languages * Ruby 99.5% * Dockerfile 0.5% * (c) 2021 GitHub, Inc. * Terms * Privacy * Security * Status * Docs * Contact GitHub * Pricing * API * Training * Blog * About You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.