hngopher.com

       [HN Gopher] Wayback Machine Downloader
       ___________________________________________________________________
        
       Wayback Machine Downloader
        
       Author : pabs3
       Score  : 182 points
       Date   : 2021-07-12 05:35 UTC (17 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | [deleted]
        
       | _def wrote:
       | > Tip: If you run into permission errors, you might have to add
       | sudo in front of this command.
       | 
       | That is not a tip, it's a dirty workaround.
        
         | mdaniel wrote:
         | And also a fantastic source of Stack Overflow questions when
         | trying to troubleshoot why some old dependency starts showing
         | up "randomly"
         | 
         | Heh, I just realized SO is a potential source of energy if we
         | could harness the copy-paste to SO question to copy-paste
         | cycle. Blockchain! :-P
        
         | hartator wrote:
         | > it's a dirty workaround
         | 
         | Yeah, but at the same time, if an attacker does have no-sudo
         | access to a machine, everything interesting is most likely
         | already compromised. Sudo does seem an hardly justifiable
         | complication in most cases.
        
       | tams wrote:
       | I'm pretty fond of using this tool to take trips down memory
       | lane, revisiting lost content I used to enjoy.
       | 
       | Browsing through crawls has this neat side-effect of being able
       | to serendipitously discover things that I missed back in the day
       | just by having everything laid out on the file system.
       | 
       | PSA: There's a lot of holes in most crawls, even for popular
       | stuff. A good way to ensure that you can revisit content later is
       | submitting links to the Wayback Machine with the "Save Page Now"
       | [1] functionality. Some local archivers like ArchiveBox [2] let
       | you automate this. Highly recommended to make a habit of it.
       | 
       | [1] https://web.archive.org/
       | 
       | [2] https://github.com/ArchiveBox/ArchiveBox
        
         | pabs3 wrote:
         | Another convenient way to interact with "Save Page Now" is just
         | to email a bunch of links to the savepagenow address at
         | archive.org. I especially like to copy all the HTML of a page
         | and paste it into a HTML email to get all the links.
        
           | toomuchtodo wrote:
           | You can also kick off retrievals from the command line:
           | 
           | https://github.com/pastpages/savepagenow
           | 
           | https://github.com/overcast07/wayback-machine-spn-scripts
        
             | pabs3 wrote:
             | Yeah, I use that API from the browser, I found the bulk
             | asynchronous zero-download email API more convenient, since
             | for a while, the save API stopped supporting HEAD requests,
             | although it seems to support it again now.
        
           | cxr wrote:
           | There are two things to note, neither of which are well-
           | advertised:
           | 
           | 1. The parent comment you're replying to links to the main
           | page for the Wayback Machine, which includes a Save Page Now
           | widget, but Save Page Now actually has a dedicated page
           | <https://web.archive.org/save/>
           | 
           | 2. If you have an archive.org account (lets you submit and
           | comment on collections; the library is bigger than just the
           | Wayback Machine) and you visit the Save Page Now page while
           | logged in, you get more options, including the option "Save
           | outlinks"
        
       | teddyh wrote:
       | > _Default is one file at a time (ie. 20)_
       | 
       | I don't understand this. Did they mean "e.g." instead of "i.e."?
        
         | jolmg wrote:
         | Even with e.g. it doesn't make much sense.
        
       | squarefoot wrote:
       | It would be handy to include an option to walk back in time
       | without start downloading until a certain site size is met, to
       | exclude 404s, domain for sale placeholders etc, which aren't
       | uncommon among old sites, so that the Archive.org precious
       | bandwidth isn't wasted.
        
         | pabs3 wrote:
         | Already has an option to include/exclude 404s. Please file
         | issues for the other requests, except the filter for domains
         | for sale, that is a hard problem that probably isn't in-scope
         | for this tool.
        
           | tempestn wrote:
           | The parent's suggestion to filter on size seems on the
           | surface like it would work. What makes you think it's harder
           | than that?
        
             | pabs3 wrote:
             | I'm not sure how filtering based on size relates to
             | filtering out domains for sale, spam domains etc. Lots of
             | legit domains have small size or large size and lots of
             | spam domains similarly vary in size.
             | 
             | Filtering out unwanted domains (sale, spam etc) is a
             | problem for a bunch of regexes, bayesian classification or
             | machine learning.
             | 
             | Edit: I think I misread the original post quite badly and I
             | don't understand the proposed feature.
        
               | squarefoot wrote:
               | My bad, I didn't specify what sizes I was referring to; I
               | meant the page size, including graphics, number of links
               | etc. If there's a way to extract a rough estimate of the
               | page "weight", it could be used to filter out empty (as
               | in clearly expired) pages without downloading them. I'm
               | not a web dev, so I'm not sure if that is possible.
        
               | EricE wrote:
               | Unfortunately a lot of adfarm sites that hoover up
               | abandoned/popular domains scrape content from other sites
               | and aggregate it to keep Google engaged at driving clicks
               | to their ad-laced pages.
               | 
               | On the surface a Yelp like system to rate domains as
               | legit vs. click bait seems logical until you realize the
               | scammers would just work at gaming that system too :p
               | 
               | This is where a universal ID would really help - but the
               | other ways something like that could be used make me even
               | more uncomfortable so here we are with no real good
               | solution :(
        
               | pabs3 wrote:
               | I think the best way to provide what you want would be to
               | provide pre-download and post-download hooks, so you
               | could write some code to prevent some downloads and
               | detect downloaded files are spammy. Then you could write
               | your size heuristic as a plugin and others could use
               | regexes or machine learning to do something similar.
        
       | beermonster wrote:
       | What's the advantage to, say, using something like wget --mirror
       | ?
        
         | betamaxthetape wrote:
         | This tool appears to be using the Wayback Machine's CDX index
         | to assist with the download.
         | 
         | The CDX basically lists all the pages of the site that are
         | archived in the wayback, as well as when each page was archived
         | (e.g: "page x was archived 3 times, on these dates"). Using the
         | CDX allows the tool to download a specific copy of the site
         | (e.g: the latest) rather than trying to download every copy of
         | the site that the wayback machine has.
         | 
         | This is important because for most sites, the wayback has
         | multiple copies, and they're all interlinking. For example, the
         | copy from May 2020 might not be complete so one of the links in
         | that copy will take you to the January 2018 copy. Not a problem
         | for a human viewer, but a bot / crawler will see pages in the
         | January 2018 copy as separate from those in the May 2020 copy,
         | so will begin downloading the January 2018 copy (because
         | wayback URLs are of the form
         | web.archive.org/<timestamp>/<archive-url> rather than
         | web.archive.org/<archive-url>/<timestamp>). This copy will
         | (inevitably) lead to other copies made at different dates, and
         | before you know it you're downloading hundreds or even
         | thousands of copies of the same site.
         | 
         | [source: tried to download a site from the wayback machine
         | several years ago using wget - it didn't end well!]
        
           | pabs3 wrote:
           | It can also download from every copy of the site that the
           | wayback machine has, sometimes that is useful.
        
         | pwdisswordfish8 wrote:
         | It doesn't add Wayback Machine's navigation bar, presumably?
        
           | pabs3 wrote:
           | Correct, it appends "id_" to the timestamp in the wayback
           | URLs, which gives you the unmodified file instead of one
           | marked up by the Wayback Machine.
        
       | yepthatsreality wrote:
       | I will be using this to restore a frequently disappearing
       | webcomic called Newspaper Comic Strip. About a man who realizes
       | he's a Newspaper Comic Strip.[0]
       | 
       | [0]
       | http://web.archive.org/web/20210119093354/http://riotfish.co...
        
         | teddyh wrote:
         | If you like that, these might also be interesting:
         | 
         | http://oneoverzero.keenspace.com/d/20000827.html
         | 
         | https://en.wikipedia.org/wiki/Sam%27s_Strip
        
       | jpswade wrote:
       | I had a look into this as part of a research project.
       | 
       | To reduce the burden of writing specific scraping software, I
       | investigated the software listed by the Archive Team[0].
       | 
       | The command line tools and libraries aside (as they would require
       | much more specific tailoring to make them work), I was
       | particularly interested in HTTrack and Warrick.
       | 
       | Warrick[1] is defined as a tool to recover lost websites using
       | various online archives and caches. Warrick[2] requires some
       | expert knowledge in order to get it up and running.
       | 
       | I found Warrick was a bit outdated, so decided to try something
       | similar but more up to date, and came across this
       | (hartator/wayback_machine_downloader).
       | 
       | I found this to be a bit easier to work with as it would allow me
       | to download snapshots within a time period, which is what I
       | needed for this project.
       | 
       | After running it for ~12 hours on my local machine, it still had
       | not completed, downloading only 11768 of 94518 files.
       | 
       | Instead, I found myself writing a Python based tool that could
       | fetch with much more accuracy using the CDX server[3] and
       | filtering by date, targeting only certain fails and allow for
       | multi threading.
       | 
       | In order to improve the process, and narrow down the data of what
       | is needed, I scoped to just July and December timeframes per
       | year, from 2010 to this year, only targeting html files.
       | 
       | For example:
       | http://web.archive.org/cdx/search/cdx?url=%s&matchType=prefi...
       | 
       | Hopefully someone finds this useful.
       | 
       | [0] https://archiveteam.org/index.php?title=Software
       | 
       | [1] https://github.com/oduwsdl/warrick
       | 
       | [2]
       | https://code.google.com/archive/p/warrick/wikis/About_Warric...
       | 
       | [3]
       | https://github.com/internetarchive/wayback/tree/master/wayba...
        
         | pabs3 wrote:
         | wayback-machine-downloader uses the CDX API too btw. IIRC it
         | has some rate limiting to avoid overloading the Wayback server.
        
         | Causality1 wrote:
         | I would love a successor to HTTRACK. It worked wonderfully back
         | in the day but modern websites have so much media and dynamic
         | content you barely get anything useful.
        
       | mendaxg wrote:
       | Lol just a crazy thought can we use this tool to download
       | archives and create a replica of existing/old sites?
        
         | pabs3 wrote:
         | Definitely, that is basically what it is for.
        
       | huxflux wrote:
       | Finally a free version, people have ripped of others with
       | previous paying services for years (I say ripped off because they
       | particularly used open source projects without attribution)
        
         | pabs3 wrote:
         | Which paid services are you referring to? It is likely that
         | these services aren't distributing the projects they are based
         | on, if so, then they are in compliance with the licenses of the
         | open source projects, which probably don't require attribution
         | unless you distribute them.
         | 
         | This project started in 2015 btw. Another similar project
         | called waybackpack started in 2016. There are probably more
         | projects. IMO wayback-machine-downloader is the better project
         | though.
         | 
         | https://github.com/jsvine/waybackpack
         | 
         | The Wayback CDX Server API these projects are based on is quite
         | simple to use btw, just some JSON responses to decode.
         | 
         | https://archive.org/help/wayback_api.php
         | https://github.com/internetarchive/wayback/blob/master/wayba...
        
       | hartator wrote:
       | I made this!
       | 
       | I had an old website of mine (my old video game portafolio) that
       | I wanted to bring back to life. I have no sources and no backups.
       | But it was still on the Wayback Machine! I first wrote a quick
       | wrapper in Ruby and it worked fine. I then decide to open source
       | it and publish it. It was a fun adventure to see this being used
       | by so many! <3
        
         | simonw wrote:
         | Thanks very much for this! I used it a few years ago to recover
         | some lost content from my blog:
         | https://simonwillison.net/2017/Oct/8/missing-content/
        
         | albatross13 wrote:
         | Kudos for taking the time to write up a nice, clean, and
         | concise readme. I don't know about everyone, but this makes all
         | the difference in the world to me.
        
       ___________________________________________________________________
       (page generated 2021-07-12 23:02 UTC)