[HN Gopher] Ask HN: How to Resurrect a Site from Archive.org?
       ___________________________________________________________________
        
       Ask HN: How to Resurrect a Site from Archive.org?
        
       I recently bought the expired domain of a niche interest site
       because the previous owner was determined to let it die and did not
       want to put any effort in it anymore.  Is there a way I can
       "revive" it from archive.org in a more or less automated fashion?
       Have you ever encountered anything like it? I am familiar with web
       scraping, but archive.org has its peculiarities.  I really, really
       love the content on it.  It's a very niche site, but I would love
       for it to live on.
        
       Author : rrr_oh_man
       Score  : 65 points
       Date   : 2024-11-30 12:57 UTC (6 days ago)
        
       | latexr wrote:
       | Have you tried searching for your question online? I found plenty
       | of results.
       | 
       | https://superuser.com/questions/828907/how-to-download-a-web...
        
         | aspenmayer wrote:
         | Specifically:
         | 
         | https://wiki.archiveteam.org/index.php?title=Restoring
         | 
         | which mentions
         | 
         | https://github.com/hartator/wayback-machine-downloader
         | 
         | and also this tip:
         | 
         | > This is undocumented, but if you retrieve a page with id_
         | after the datecode, you will get the unmodified original
         | document without all the Wayback scripts, header stuff, and
         | link rewriting. This is useful when restoring one page at a
         | time or when writing a tool to retrieve a site:
         | 
         | >
         | http://web.archive.org/web/20051001001126id_/http://www.arch...
         | 
         | From the downloader's issues, you may or may not need to use
         | this forked version if you encounter some errors:
         | 
         | https://github.com/hartator/wayback-machine-downloader/issue...
         | 
         | https://github.com/ShiftaDeband/wayback-machine-downloader
        
       | bagpuss wrote:
       | Archivarix is the most fully formed easiest way to do this, free
       | https://archivarix.com/
        
         | aspenmayer wrote:
         | This site is misrepresenting itself as open source and free,
         | while simultaneously having an affiliate program and pricing
         | page, which, as I've said, isn't free. It's unverifiable
         | whether or not it's open source, as you don't even download/run
         | the software yourself: it's a web app, which is beside the
         | point, as web apps _could_ also be open source, but as there 's
         | no way to self-host this, let alone download it and/or run it,
         | for free or otherwise. I think it's safe to avoid this scammy
         | site.
         | 
         | None of my ire is directed at you, as I don't assume you knew
         | any of this. I just wanted to let you know, in case you were
         | mislead as to what the site does by its ad copy.
         | 
         | https://archivarix.com/en/affiliate/
         | 
         | https://archivarix.com/en/#show-prices-wbm
        
           | KomoD wrote:
           | > This site is misrepresenting itself as open source and free
           | 
           | It's not.
           | 
           | They don't say that the site and all the services they offer
           | are free and open-source, they say that the Archivarix CMS is
           | free and open-source (GNU GPLv3), which it is...
           | 
           | > as you don't even download/run the software yourself
           | 
           | You can download the CMS.
           | 
           | > but as there's no way to self-host this, let alone download
           | it and/or run it, for free or otherwise
           | 
           | Again yes you can both download it and self-host the CMS
           | 
           | > I think it's safe to avoid this scammy site.
           | 
           | It's scammy because they're not offering everything for free
           | and open-source even though they never said they would?
           | 
           | https://archivarix.com/en/cms/
        
             | aspenmayer wrote:
             | Unless the CMS lets you backup/restore Internet Archive
             | sites, then that is literally off-topic, beside the point,
             | and doesn't make sense given the context of 'bagpuss's
             | comment. That 'bagpuss was vague about what they said was
             | free doesn't change the context of the discussion.
             | 
             | I stand by what I said, as the Internet Archive feature,
             | which is the entire point of OP's post, is not free on
             | their platform. The CMS is not relevant to this discussion.
             | 
             | It's scammy because the kind of people who would use this
             | wouldn't know how many files are in the backup because they
             | are likely no/low-context users who are likely not familiar
             | with concepts like "average or expected number of files on
             | a website." The pricing is usurious and exploitative due to
             | the pricing model being per file versus by file size for
             | example.
        
               | KomoD wrote:
               | > then that is literally off-topic
               | 
               | It's not, you're claiming they said something they
               | didn't.
               | 
               | > and doesn't make sense given the context of 'bagpuss's
               | comment. That 'bagpuss was vague about what they said was
               | free doesn't change the context of the discussion.
               | 
               | I don't care about bagpuss's comment, they don't
               | represent Archivarix as far as I can tell.
               | 
               | You said the site is misrepresenting itself, it's not.
               | 
               | You said the site is claiming things they haven't.
               | 
               | You called the site "scammy" based on something that they
               | never even claimed.
               | 
               | > It's scammy because the kind of people who would use
               | this wouldn't know how many files are in the backup
               | 
               | Archive.org tells you how many URLs are saved.
               | 
               | Example:
               | https://web.archive.org/details/https://sweetcode.io
               | 
               | 2109+4716+595+732+562+90+28+1+9+1 = 8843 unique URLs.
               | 
               | First file is free. First 1000 files are $0.01 each.
               | Additional thousands are $1 per thousand.
               | 
               | So the price would be $17.84
        
               | aspenmayer wrote:
               | I was responding to 'bagpuss because they made a claim
               | that the site linked was respondent to OP, and was free.
               | You and I can quibble about what they meant, but a plain
               | reading of their comment implies that the functionality
               | that OP asked for was free, because 'bagpuss never
               | mentioned the CMS, and in fact the CMS seems like a red
               | herring in this discussion entirely.
               | 
               | I do feel that the site misrepresents the value
               | proposition of the Internet Archive backup/restore
               | service, because the site's value proposition is
               | convenience for users who don't know that there are
               | actually free, actually open source ways to backup and
               | restore content from Internet Archive, and that site
               | isn't it. They're banking on users not knowing any better
               | in that case, which isn't unethical per se, buyer beware
               | etc, but it's shady.
               | 
               | That above combined with the pricing model makes it
               | scammy because you have to spend a minimum of $10 in
               | crypto or other non-reversible payment for something that
               | should not cost the user anything, as the Internet
               | Archive is bearing the lion's share of the costs. And if
               | it doesn't do what you needed, you've already paid in
               | worthless credits.
               | 
               | https://archivarix.com/en/tutorial/#list-3
               | 
               | > Second example: the big site contains 25,520 files.
               | From this quantity you can deduct 1 because they will be
               | free of charge. So we have 25,519 paid files. First
               | thousand will cost $10, and the rest 24,519 costs only $1
               | per thousand, therefore $24.519 . Full price for the big
               | site recovery is $34.52!!!
               | 
               | $34.52 is not a reasonable price for this by any means.
               | 
               | That said, I make no claims about the site being
               | respondent to OP's request, as I'm not OP. I simply
               | rejected the claims brought by 'bagpuss.
        
               | KomoD wrote:
               | > but a plain reading of their comment implies that the
               | functionality that OP asked for was free, because
               | 'bagpuss never mentioned the CMS, and in fact the CMS
               | seems like a red herring in this discussion entirely.
               | 
               | What I'm saying is YOU said THE SITE was misrepresenting
               | itself when THE SITE isn't. It would've been BAGPUSS that
               | was misrepresenting THE SITE if anyone.
               | 
               | > for something that should not cost the user anything,
               | as the Internet Archive is bearing the lion's share of
               | the costs.
               | 
               | It's still costing Archivarix money to run the service,
               | yes you are paying for convenience, I see nothing wrong
               | with that whatsoever.
               | 
               | Ideally the Internet Archive should provide an easy way
               | to download sites but they don't.
               | 
               | > $34.52 is not a reasonable price for this by any means.
               | 
               | Why is it not reasonable? They spent time developing this
               | service and it costs money to run, if you want to save
               | money then yeah you can recover it yourself with some
               | open-source software like wayback-machine-downloader, but
               | some people just want to recover sites without having to
               | bother with any of that.
        
               | aspenmayer wrote:
               | > What I'm saying is YOU said THE SITE was
               | misrepresenting itself when THE SITE isn't. It would be
               | BAGPUSS that was misrepresenting THE SITE.
               | 
               | Both of these things can be true, that 'bagpuss was
               | misrepresenting the site, and the site is intentionally
               | vague as to what is free and what isn't so as to muddy
               | the waters and paint themselves as saviors and good
               | people for being open source while overcharging for a
               | product to the degree that the site misrepresents itself,
               | and I believe that they both are true.
               | 
               | > Ideally the Internet Archive should provide an easy way
               | to download sites but they don't.
               | 
               | I agree, but that's not really relevant to our discussion
               | or to 'bagpuss's claims.
               | 
               | And if IA did provide an easy way to do that, the site
               | linked would be an _even worse deal_.
               | 
               | The site is misrepresenting itself as being worth paying
               | for at any price.
               | 
               | Furthermore, you can download an entire site using your
               | web browser 'Save page as' -> 'web page, complete' dialog
               | in conjunction with the undocumented trick:
               | 
               | > This is undocumented, but if you retrieve a page with
               | id_ after the datecode, you will get the unmodified
               | original document without all the Wayback scripts, header
               | stuff, and link rewriting.
               | 
               | Seems pretty easy to me, but only if you know how. Which
               | is the only reason anyone would use that site - they
               | simply don't know how bad a deal the site is, or they
               | have more dollars than sense.
        
               | KomoD wrote:
               | > and the site is intentionally vague as to what is free
               | 
               | It's not? It says the CMS is free and open-source and
               | they have prices listed for the paid services they
               | provide.
               | 
               | > and paint themselves as saviors and good people for
               | being open source while overcharging for a product to the
               | degree that the site misrepresents itself, and I believe
               | that they both are true.
               | 
               | Simply saying that something is open-source is you
               | painting yourself as a "savior"?
               | 
               | > And if IA did provide an easy way to do that, the site
               | linked would be an even worse deal.
               | 
               | Obviously, if they did provide it then there would be no
               | reason at all to pay.
               | 
               | > Furthermore, you can download an entire site using your
               | web browser 'Save page as' -> 'web page, complete' dialog
               | in conjunction with the undocumented trick:
               | 
               | No, not an entire site, just the current HTML document
               | and the accompanying files for it (e.g. scripts, images,
               | etc.) If you want to sit for hours manually doing that
               | for thousands of pages then feel free.
        
               | aspenmayer wrote:
               | Do you work for the site or something?
               | 
               | > It's not? It says the CMS is free and open-source and
               | they have prices listed for the paid services they
               | provide.
               | 
               | You brought up the CMS. I didn't. I don't have any point
               | to defend regarding it. 'bagpuss was wrong about what
               | they said about the site, and I replied to that.
               | 
               | > Simply saying that something is open-source is you
               | painting yourself as a "savior"?
               | 
               | It's called marketing.
               | 
               | Are you unfamiliar with what scammy means? The site feels
               | scammy to me. So I said so. I don't think you can
               | demonstrate that I _don't_ believe it's scammy, and you
               | haven't convinced me either.
               | 
               | > Obviously, if they did provide it then there would be
               | no reason at all to pay.
               | 
               | I don't have any reason to pay either. 'bagpuss can
               | defend the scammy site, but I won't so I agree there's no
               | reason to pay, for different reasons.
               | 
               | > No, not an entire site, just the current HTML document
               | and the accompanying files for it (e.g. scripts, images,
               | etc.) If you want to sit for hours manually doing that
               | for thousands of pages then feel free.
               | 
               | I have no reason to believe a scammy site will do any
               | better than that either. You haven't demonstrated that
               | the site even works, and their marketing doesn't inspire
               | confidence.
               | 
               | As I didn't introduce the site, I'm not beholden to
               | supporting it or not. Take 'bagpuss to task if anyone.
               | 
               | I don't think you know what you're even arguing about or
               | for because none of your arguments or claims even go
               | anywhere, they all revolve around this scammy site that
               | you didn't even bring up. Nothing about your argument
               | makes sense.
               | 
               | That you haven't made any effort to correct 'bagpuss by
               | replying to them directly is curious.
        
               | KomoD wrote:
               | > Do you work for the site or something?
               | 
               | Nope.
               | 
               | > You brought up the CMS. I didn't.
               | 
               | Yes, again, it was to explain to you that they only say
               | that the CMS is free, not the services because you said:
               | 
               | > This site is misrepresenting itself as open source and
               | free, while simultaneously having an affiliate program
               | and pricing page, which, as I've said, isn't free
               | 
               | They only said their CMS was open-source and free, not
               | any of their other services.
               | 
               | > I don't think you know what you're even arguing about
               | or for because none of your arguments or claims even go
               | anywhere
               | 
               | I was correcting you because you said things that just
               | aren't true:
               | 
               | > This site is misrepresenting itself as open source and
               | free, while simultaneously having an affiliate program
               | and pricing page, which, as I've said, isn't free. It's
               | unverifiable whether or not it's open source, as you
               | don't even download/run the software yourself: it's a web
               | app, which is beside the point, as web apps could also be
               | open source, but as there's no way to self-host this, let
               | alone download it and/or run it, for free or otherwise. I
               | think it's safe to avoid this scammy site.
               | 
               | Which is just not accurate at all, as I've already
               | explained several times. You can dislike the site all you
               | want but you don't need to slander them.
        
               | aspenmayer wrote:
               | >> to clarify, i have nothing to do with this site, i
               | used it once, years back and there was a free tier or at
               | least a free/crippled version at that time
               | 
               | https://news.ycombinator.com/item?id=42291616
               | 
               | Per 'bagpuss, the backup _was_ free when they used it,
               | and they were referring to the backup, _not_ the CMS.
               | 
               | So, I would argue you were mistaken.
        
         | bagpuss wrote:
         | > to clarify, i have nothing to do with this site, i used it
         | once, years back and there was a free tier or at least a
         | free/crippled version at that time
         | 
         | posters, enhance your calm
         | 
         | - bagpuss, fat furry cat puss
        
           | aspenmayer wrote:
           | As it's not free anymore, do you still recommend using it, or
           | do you have a different alternative recommendation in light
           | of it no longer being free?
           | 
           | I appreciate your feedback. Not sure why 'KomoD is defending
           | the site, but at least you understand that it's relevant
           | whether it's free or not.
        
       | duskwuff wrote:
       | > I recently bought the expired domain of a niche interest site
       | because the previous owner was determined to let it die and did
       | not want to put any effort in it anymore. Is there a way I can
       | "revive" it from archive.org in a more or less automated fashion?
       | 
       | Buying a domain name does not award you ownership of the content
       | it previously hosted. If you have not come to some agreement with
       | the previous owner, you should not proceed.
        
         | aspenmayer wrote:
         | Well we can't really assume either way, as OP was vague about
         | how the site was left abandoned. They may have some arragement
         | that would make this not copyright infringing. In the absence
         | of any affirmative assent in writing reviewed by legal counsel,
         | I'd be inclined to agree with you, and yet I sought to provide
         | the best answer to the question provided, as the legal issues
         | were outside the scope of the question as asked, and the legal
         | issues you raised seem obvious to you and I, and ought also to
         | be so to OP, but we can't make assumptions about the license of
         | the content in question and/or the relevant jurisdiction(s),
         | which may make these points all moot.
        
         | moralestapia wrote:
         | How's that different from the site being hosted at archive.org?
        
           | karel-3d wrote:
           | archive.org approach to copyright is "look, squirrel".
        
             | fastily wrote:
             | "Fair use"
        
         | lhamil64 wrote:
         | What if OP just had the domain redirect to the archive.org
         | page? Then they wouldn't be hosting the content themselves
        
       | donalhunt wrote:
       | Did this 10+ years ago for a circa-2000 band website (was a few
       | html pages). Was fairly straightforward to achieve. Some content
       | (embedded from 3rd party websites) was not recoverable.
        
       | Alifatisk wrote:
       | HTTrack? You should not do it without the owners consent though.
        
         | aspenmayer wrote:
         | Seems legit.
         | 
         | > HTTrack is a free (GPL, libre/free software) and easy-to-use
         | offline browser utility.
         | 
         | Available on Windows, Mac, Linux, and Android.
        
       | paxys wrote:
       | Have you spoken to the previous owner about any of this?
       | Otherwise it's pretty crazy to just take ownership of the site
       | and all its content without a written agreement in place. You are
       | opening yourself up to a massive amount of liability for no
       | reason.
        
         | aspenmayer wrote:
         | I agree with your points, but as the original host of the site
         | no longer is continuing to host it, I doubt they would be any
         | more interested in what others do regarding it, but a lawsuit
         | with a potential payday might motivate them. I broadly agree
         | with you though.
        
       | moxvallix wrote:
       | You can use wayback_machine_downloader to automate downloading
       | the archived pages https://github.com/hartator/wayback-machine-
       | downloader/
        
         | d3VwsX wrote:
         | That used to work great for me, but recently it started to
         | fail. It downloads a few pages but then it gets errors, as if
         | it is detected and prevented by the server from scraping.
        
           | toomuchtodo wrote:
           | > as if it is detected and prevented by the server from
           | scraping.
           | 
           | Yes.
        
       | toast0 wrote:
       | I did this for a niche site, but it was only 20 pages.
       | 
       | I pulled each page off internet archive, saved it as an archive;
       | then did some minor tidying up, setting viewports for mobile,
       | updating the linkback html snippet to go to my url instead of the
       | old dead one, changing the snippet to not suggest hotloading the
       | link image, crop the dead url out of the link image, pngcrush the
       | images, put it on cheap hosting for static pages.
       | 
       | I did a bit of poking around trying to find a way to contact the
       | owner, but had no luck. If they come back and want it down, I'll
       | take it down. Copyright notices are intact. I'm clearly violating
       | the author's copyrights, and I accept that.
        
         | gopher_space wrote:
         | > I'm clearly violating the author's copyrights, and I accept
         | that.
         | 
         | I'm looking at combining several old message boards into
         | something useful, and I'd like to be proactive regarding
         | copyright. My approach so far:
         | 
         | - I'm assuming that everyone owns their own post/comment.
         | 
         | - I'm assuming that submitting content meant they intended to
         | grant rights to community members.
         | 
         | - I'm assuming that work done in support of the original
         | community would be welcomed by members.
         | 
         | - And I'm assuming this all changes if I want money.
         | 
         | So I'm preserving attributions when I can, but treat content
         | like it's CC or similar as long as I'm operating within the
         | original authors area of concern. Anything that actually gets
         | released will be as open as possible... and probably start with
         | telling you how to download files. Entirely walling off my code
         | makes sense but then it is no longer a fun little project, it
         | is a framework.
        
       | janesvilleseo wrote:
       | This something that used to be done quite a bit in the SEO world.
       | Not sure if still holds and SEO value. Probably some, but maybe
       | not the same level.
       | 
       | Anyways there are tools out there. I haven't used them
       | 
       | But a tool like
       | https://www6.waybackmachinedownloader.com/website-downloader...
       | 
       | Or
       | 
       | https://websitedownloader.com/
       | 
       | Should do the trick. Depending on the size of the site a small
       | cost is involved.
       | 
       | They can even package them into unusable files.
        
       | canU4 wrote:
       | Isn't a simple wget -r enough?
        
       | comboy wrote:
       | wget --mirror --convert-links --page-requisites --no-parent URL
       | 
       | But yeah it's also not clear to me regarding copyrights and such.
        
       | ulrischa wrote:
       | I've seen a lot of people do this when resurrecting old niche
       | sites. The high-level approach usually involves grabbing all the
       | snapshots from archive.org, stripping out their timestamped URLs,
       | and consolidating everything into a local mirror. In practice,
       | you want to:
       | 
       | 1. Collect a list of archived URLs (via archive.org's CDX
       | endpoints). 2. Download each page and all related assets. 3.
       | Rewrite all links that currently point to `web.archive.org` so
       | they point to your domain or your local file paths.
       | 
       | The tricky part is the Wayback Machine's directory structure--
       | every file is wrapped in these time-stamped URLs. You'll need to
       | remove those prefixes, leaving just the original directory
       | layout. There's no perfect, purely automated solution, because
       | sometimes assets are missing or broken. Be prepared for some
       | manual cleanup.
       | 
       | Beyond that, the process is basically: gather everything, clean
       | up links, restore the original hierarchy, and then host it on
       | your server. Tools exist that partially automate this (for
       | example, some people have written scripts to do the CDX fetching
       | and rewriting), but if you're comfortable with web scraping
       | logic, you can handle it with a few careful passes. In the end,
       | you'll have a mostly faithful static snapshot of the old site
       | running under your revived domain.
        
       | Sysreq2 wrote:
       | You could also consider using the Common Crawl dataset provided
       | by Amazon. Archive.org is more or less a wrapper around it
       | anyways.
       | 
       | https://registry.opendata.aws/commoncrawl/
        
       | aoipoa wrote:
       | This was posted 6 days ago but it's reappeared now 4 hours ago.
       | What happened?
       | 
       | https://hn.algolia.com/?q=ask+hn+resurrect+site+archive
       | 
       | Very odd.
       | 
       | Even the times of the comments have changed, this is what the
       | post looked like yesterday:
       | 
       | https://web.archive.org/web/20241205054108/https://news.ycom...
        
         | denotational wrote:
         | HN has a "resubmit" mechanism whereby the mods can resubmit
         | interesting posts if they think they might stimulate more
         | interest by being posted at a different time (or just by having
         | better luck).
         | 
         | To avoid a dupe, this mechanism post-dates the original post.
        
       | alsetmusic wrote:
       | I've been thinking about buying a sibling domain (.net instead of
       | .com) to re-host a fantastic essay that disappeared from the web
       | some years back. I would make it clear that I didn't write it and
       | offer to remove it if the original author contacted me requesting
       | that I remove it (it did not include attribution in its original
       | form). But the issue has been enough of a grey area that I
       | haven't pulled the trigger.
       | 
       | For anyone who may be curious, wayback machine has an archive:
       | fuckthesouth.com
        
       | joshdavham wrote:
       | Can I ask what site it was? Reading this made me think of a very
       | specific site that I'd also like to see revived and I'm wondering
       | if we're thinking of the same site.
        
       ___________________________________________________________________
       (page generated 2024-12-06 23:02 UTC)