[HN Gopher] Wayback Machine Downloader
___________________________________________________________________
Wayback Machine Downloader
Author : pabs3
Score : 182 points
Date : 2021-07-12 05:35 UTC (17 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| [deleted]
| _def wrote:
| > Tip: If you run into permission errors, you might have to add
| sudo in front of this command.
|
| That is not a tip, it's a dirty workaround.
| mdaniel wrote:
| And also a fantastic source of Stack Overflow questions when
| trying to troubleshoot why some old dependency starts showing
| up "randomly"
|
| Heh, I just realized SO is a potential source of energy if we
| could harness the copy-paste to SO question to copy-paste
| cycle. Blockchain! :-P
| hartator wrote:
| > it's a dirty workaround
|
| Yeah, but at the same time, if an attacker does have no-sudo
| access to a machine, everything interesting is most likely
| already compromised. Sudo does seem an hardly justifiable
| complication in most cases.
| tams wrote:
| I'm pretty fond of using this tool to take trips down memory
| lane, revisiting lost content I used to enjoy.
|
| Browsing through crawls has this neat side-effect of being able
| to serendipitously discover things that I missed back in the day
| just by having everything laid out on the file system.
|
| PSA: There's a lot of holes in most crawls, even for popular
| stuff. A good way to ensure that you can revisit content later is
| submitting links to the Wayback Machine with the "Save Page Now"
| [1] functionality. Some local archivers like ArchiveBox [2] let
| you automate this. Highly recommended to make a habit of it.
|
| [1] https://web.archive.org/
|
| [2] https://github.com/ArchiveBox/ArchiveBox
| pabs3 wrote:
| Another convenient way to interact with "Save Page Now" is just
| to email a bunch of links to the savepagenow address at
| archive.org. I especially like to copy all the HTML of a page
| and paste it into a HTML email to get all the links.
| toomuchtodo wrote:
| You can also kick off retrievals from the command line:
|
| https://github.com/pastpages/savepagenow
|
| https://github.com/overcast07/wayback-machine-spn-scripts
| pabs3 wrote:
| Yeah, I use that API from the browser, I found the bulk
| asynchronous zero-download email API more convenient, since
| for a while, the save API stopped supporting HEAD requests,
| although it seems to support it again now.
| cxr wrote:
| There are two things to note, neither of which are well-
| advertised:
|
| 1. The parent comment you're replying to links to the main
| page for the Wayback Machine, which includes a Save Page Now
| widget, but Save Page Now actually has a dedicated page
| <https://web.archive.org/save/>
|
| 2. If you have an archive.org account (lets you submit and
| comment on collections; the library is bigger than just the
| Wayback Machine) and you visit the Save Page Now page while
| logged in, you get more options, including the option "Save
| outlinks"
| teddyh wrote:
| > _Default is one file at a time (ie. 20)_
|
| I don't understand this. Did they mean "e.g." instead of "i.e."?
| jolmg wrote:
| Even with e.g. it doesn't make much sense.
| squarefoot wrote:
| It would be handy to include an option to walk back in time
| without start downloading until a certain site size is met, to
| exclude 404s, domain for sale placeholders etc, which aren't
| uncommon among old sites, so that the Archive.org precious
| bandwidth isn't wasted.
| pabs3 wrote:
| Already has an option to include/exclude 404s. Please file
| issues for the other requests, except the filter for domains
| for sale, that is a hard problem that probably isn't in-scope
| for this tool.
| tempestn wrote:
| The parent's suggestion to filter on size seems on the
| surface like it would work. What makes you think it's harder
| than that?
| pabs3 wrote:
| I'm not sure how filtering based on size relates to
| filtering out domains for sale, spam domains etc. Lots of
| legit domains have small size or large size and lots of
| spam domains similarly vary in size.
|
| Filtering out unwanted domains (sale, spam etc) is a
| problem for a bunch of regexes, bayesian classification or
| machine learning.
|
| Edit: I think I misread the original post quite badly and I
| don't understand the proposed feature.
| squarefoot wrote:
| My bad, I didn't specify what sizes I was referring to; I
| meant the page size, including graphics, number of links
| etc. If there's a way to extract a rough estimate of the
| page "weight", it could be used to filter out empty (as
| in clearly expired) pages without downloading them. I'm
| not a web dev, so I'm not sure if that is possible.
| EricE wrote:
| Unfortunately a lot of adfarm sites that hoover up
| abandoned/popular domains scrape content from other sites
| and aggregate it to keep Google engaged at driving clicks
| to their ad-laced pages.
|
| On the surface a Yelp like system to rate domains as
| legit vs. click bait seems logical until you realize the
| scammers would just work at gaming that system too :p
|
| This is where a universal ID would really help - but the
| other ways something like that could be used make me even
| more uncomfortable so here we are with no real good
| solution :(
| pabs3 wrote:
| I think the best way to provide what you want would be to
| provide pre-download and post-download hooks, so you
| could write some code to prevent some downloads and
| detect downloaded files are spammy. Then you could write
| your size heuristic as a plugin and others could use
| regexes or machine learning to do something similar.
| beermonster wrote:
| What's the advantage to, say, using something like wget --mirror
| ?
| betamaxthetape wrote:
| This tool appears to be using the Wayback Machine's CDX index
| to assist with the download.
|
| The CDX basically lists all the pages of the site that are
| archived in the wayback, as well as when each page was archived
| (e.g: "page x was archived 3 times, on these dates"). Using the
| CDX allows the tool to download a specific copy of the site
| (e.g: the latest) rather than trying to download every copy of
| the site that the wayback machine has.
|
| This is important because for most sites, the wayback has
| multiple copies, and they're all interlinking. For example, the
| copy from May 2020 might not be complete so one of the links in
| that copy will take you to the January 2018 copy. Not a problem
| for a human viewer, but a bot / crawler will see pages in the
| January 2018 copy as separate from those in the May 2020 copy,
| so will begin downloading the January 2018 copy (because
| wayback URLs are of the form
| web.archive.org/<timestamp>/<archive-url> rather than
| web.archive.org/<archive-url>/<timestamp>). This copy will
| (inevitably) lead to other copies made at different dates, and
| before you know it you're downloading hundreds or even
| thousands of copies of the same site.
|
| [source: tried to download a site from the wayback machine
| several years ago using wget - it didn't end well!]
| pabs3 wrote:
| It can also download from every copy of the site that the
| wayback machine has, sometimes that is useful.
| pwdisswordfish8 wrote:
| It doesn't add Wayback Machine's navigation bar, presumably?
| pabs3 wrote:
| Correct, it appends "id_" to the timestamp in the wayback
| URLs, which gives you the unmodified file instead of one
| marked up by the Wayback Machine.
| yepthatsreality wrote:
| I will be using this to restore a frequently disappearing
| webcomic called Newspaper Comic Strip. About a man who realizes
| he's a Newspaper Comic Strip.[0]
|
| [0]
| http://web.archive.org/web/20210119093354/http://riotfish.co...
| teddyh wrote:
| If you like that, these might also be interesting:
|
| http://oneoverzero.keenspace.com/d/20000827.html
|
| https://en.wikipedia.org/wiki/Sam%27s_Strip
| jpswade wrote:
| I had a look into this as part of a research project.
|
| To reduce the burden of writing specific scraping software, I
| investigated the software listed by the Archive Team[0].
|
| The command line tools and libraries aside (as they would require
| much more specific tailoring to make them work), I was
| particularly interested in HTTrack and Warrick.
|
| Warrick[1] is defined as a tool to recover lost websites using
| various online archives and caches. Warrick[2] requires some
| expert knowledge in order to get it up and running.
|
| I found Warrick was a bit outdated, so decided to try something
| similar but more up to date, and came across this
| (hartator/wayback_machine_downloader).
|
| I found this to be a bit easier to work with as it would allow me
| to download snapshots within a time period, which is what I
| needed for this project.
|
| After running it for ~12 hours on my local machine, it still had
| not completed, downloading only 11768 of 94518 files.
|
| Instead, I found myself writing a Python based tool that could
| fetch with much more accuracy using the CDX server[3] and
| filtering by date, targeting only certain fails and allow for
| multi threading.
|
| In order to improve the process, and narrow down the data of what
| is needed, I scoped to just July and December timeframes per
| year, from 2010 to this year, only targeting html files.
|
| For example:
| http://web.archive.org/cdx/search/cdx?url=%s&matchType=prefi...
|
| Hopefully someone finds this useful.
|
| [0] https://archiveteam.org/index.php?title=Software
|
| [1] https://github.com/oduwsdl/warrick
|
| [2]
| https://code.google.com/archive/p/warrick/wikis/About_Warric...
|
| [3]
| https://github.com/internetarchive/wayback/tree/master/wayba...
| pabs3 wrote:
| wayback-machine-downloader uses the CDX API too btw. IIRC it
| has some rate limiting to avoid overloading the Wayback server.
| Causality1 wrote:
| I would love a successor to HTTRACK. It worked wonderfully back
| in the day but modern websites have so much media and dynamic
| content you barely get anything useful.
| mendaxg wrote:
| Lol just a crazy thought can we use this tool to download
| archives and create a replica of existing/old sites?
| pabs3 wrote:
| Definitely, that is basically what it is for.
| huxflux wrote:
| Finally a free version, people have ripped of others with
| previous paying services for years (I say ripped off because they
| particularly used open source projects without attribution)
| pabs3 wrote:
| Which paid services are you referring to? It is likely that
| these services aren't distributing the projects they are based
| on, if so, then they are in compliance with the licenses of the
| open source projects, which probably don't require attribution
| unless you distribute them.
|
| This project started in 2015 btw. Another similar project
| called waybackpack started in 2016. There are probably more
| projects. IMO wayback-machine-downloader is the better project
| though.
|
| https://github.com/jsvine/waybackpack
|
| The Wayback CDX Server API these projects are based on is quite
| simple to use btw, just some JSON responses to decode.
|
| https://archive.org/help/wayback_api.php
| https://github.com/internetarchive/wayback/blob/master/wayba...
| hartator wrote:
| I made this!
|
| I had an old website of mine (my old video game portafolio) that
| I wanted to bring back to life. I have no sources and no backups.
| But it was still on the Wayback Machine! I first wrote a quick
| wrapper in Ruby and it worked fine. I then decide to open source
| it and publish it. It was a fun adventure to see this being used
| by so many! <3
| simonw wrote:
| Thanks very much for this! I used it a few years ago to recover
| some lost content from my blog:
| https://simonwillison.net/2017/Oct/8/missing-content/
| albatross13 wrote:
| Kudos for taking the time to write up a nice, clean, and
| concise readme. I don't know about everyone, but this makes all
| the difference in the world to me.
___________________________________________________________________
(page generated 2021-07-12 23:02 UTC)