[HN Gopher] Ask HN: How to Resurrect a Site from Archive.org?
___________________________________________________________________
Ask HN: How to Resurrect a Site from Archive.org?
I recently bought the expired domain of a niche interest site
because the previous owner was determined to let it die and did not
want to put any effort in it anymore. Is there a way I can
"revive" it from archive.org in a more or less automated fashion?
Have you ever encountered anything like it? I am familiar with web
scraping, but archive.org has its peculiarities. I really, really
love the content on it. It's a very niche site, but I would love
for it to live on.
Author : rrr_oh_man
Score : 65 points
Date : 2024-11-30 12:57 UTC (6 days ago)
| latexr wrote:
| Have you tried searching for your question online? I found plenty
| of results.
|
| https://superuser.com/questions/828907/how-to-download-a-web...
| aspenmayer wrote:
| Specifically:
|
| https://wiki.archiveteam.org/index.php?title=Restoring
|
| which mentions
|
| https://github.com/hartator/wayback-machine-downloader
|
| and also this tip:
|
| > This is undocumented, but if you retrieve a page with id_
| after the datecode, you will get the unmodified original
| document without all the Wayback scripts, header stuff, and
| link rewriting. This is useful when restoring one page at a
| time or when writing a tool to retrieve a site:
|
| >
| http://web.archive.org/web/20051001001126id_/http://www.arch...
|
| From the downloader's issues, you may or may not need to use
| this forked version if you encounter some errors:
|
| https://github.com/hartator/wayback-machine-downloader/issue...
|
| https://github.com/ShiftaDeband/wayback-machine-downloader
| bagpuss wrote:
| Archivarix is the most fully formed easiest way to do this, free
| https://archivarix.com/
| aspenmayer wrote:
| This site is misrepresenting itself as open source and free,
| while simultaneously having an affiliate program and pricing
| page, which, as I've said, isn't free. It's unverifiable
| whether or not it's open source, as you don't even download/run
| the software yourself: it's a web app, which is beside the
| point, as web apps _could_ also be open source, but as there 's
| no way to self-host this, let alone download it and/or run it,
| for free or otherwise. I think it's safe to avoid this scammy
| site.
|
| None of my ire is directed at you, as I don't assume you knew
| any of this. I just wanted to let you know, in case you were
| mislead as to what the site does by its ad copy.
|
| https://archivarix.com/en/affiliate/
|
| https://archivarix.com/en/#show-prices-wbm
| KomoD wrote:
| > This site is misrepresenting itself as open source and free
|
| It's not.
|
| They don't say that the site and all the services they offer
| are free and open-source, they say that the Archivarix CMS is
| free and open-source (GNU GPLv3), which it is...
|
| > as you don't even download/run the software yourself
|
| You can download the CMS.
|
| > but as there's no way to self-host this, let alone download
| it and/or run it, for free or otherwise
|
| Again yes you can both download it and self-host the CMS
|
| > I think it's safe to avoid this scammy site.
|
| It's scammy because they're not offering everything for free
| and open-source even though they never said they would?
|
| https://archivarix.com/en/cms/
| aspenmayer wrote:
| Unless the CMS lets you backup/restore Internet Archive
| sites, then that is literally off-topic, beside the point,
| and doesn't make sense given the context of 'bagpuss's
| comment. That 'bagpuss was vague about what they said was
| free doesn't change the context of the discussion.
|
| I stand by what I said, as the Internet Archive feature,
| which is the entire point of OP's post, is not free on
| their platform. The CMS is not relevant to this discussion.
|
| It's scammy because the kind of people who would use this
| wouldn't know how many files are in the backup because they
| are likely no/low-context users who are likely not familiar
| with concepts like "average or expected number of files on
| a website." The pricing is usurious and exploitative due to
| the pricing model being per file versus by file size for
| example.
| KomoD wrote:
| > then that is literally off-topic
|
| It's not, you're claiming they said something they
| didn't.
|
| > and doesn't make sense given the context of 'bagpuss's
| comment. That 'bagpuss was vague about what they said was
| free doesn't change the context of the discussion.
|
| I don't care about bagpuss's comment, they don't
| represent Archivarix as far as I can tell.
|
| You said the site is misrepresenting itself, it's not.
|
| You said the site is claiming things they haven't.
|
| You called the site "scammy" based on something that they
| never even claimed.
|
| > It's scammy because the kind of people who would use
| this wouldn't know how many files are in the backup
|
| Archive.org tells you how many URLs are saved.
|
| Example:
| https://web.archive.org/details/https://sweetcode.io
|
| 2109+4716+595+732+562+90+28+1+9+1 = 8843 unique URLs.
|
| First file is free. First 1000 files are $0.01 each.
| Additional thousands are $1 per thousand.
|
| So the price would be $17.84
| aspenmayer wrote:
| I was responding to 'bagpuss because they made a claim
| that the site linked was respondent to OP, and was free.
| You and I can quibble about what they meant, but a plain
| reading of their comment implies that the functionality
| that OP asked for was free, because 'bagpuss never
| mentioned the CMS, and in fact the CMS seems like a red
| herring in this discussion entirely.
|
| I do feel that the site misrepresents the value
| proposition of the Internet Archive backup/restore
| service, because the site's value proposition is
| convenience for users who don't know that there are
| actually free, actually open source ways to backup and
| restore content from Internet Archive, and that site
| isn't it. They're banking on users not knowing any better
| in that case, which isn't unethical per se, buyer beware
| etc, but it's shady.
|
| That above combined with the pricing model makes it
| scammy because you have to spend a minimum of $10 in
| crypto or other non-reversible payment for something that
| should not cost the user anything, as the Internet
| Archive is bearing the lion's share of the costs. And if
| it doesn't do what you needed, you've already paid in
| worthless credits.
|
| https://archivarix.com/en/tutorial/#list-3
|
| > Second example: the big site contains 25,520 files.
| From this quantity you can deduct 1 because they will be
| free of charge. So we have 25,519 paid files. First
| thousand will cost $10, and the rest 24,519 costs only $1
| per thousand, therefore $24.519 . Full price for the big
| site recovery is $34.52!!!
|
| $34.52 is not a reasonable price for this by any means.
|
| That said, I make no claims about the site being
| respondent to OP's request, as I'm not OP. I simply
| rejected the claims brought by 'bagpuss.
| KomoD wrote:
| > but a plain reading of their comment implies that the
| functionality that OP asked for was free, because
| 'bagpuss never mentioned the CMS, and in fact the CMS
| seems like a red herring in this discussion entirely.
|
| What I'm saying is YOU said THE SITE was misrepresenting
| itself when THE SITE isn't. It would've been BAGPUSS that
| was misrepresenting THE SITE if anyone.
|
| > for something that should not cost the user anything,
| as the Internet Archive is bearing the lion's share of
| the costs.
|
| It's still costing Archivarix money to run the service,
| yes you are paying for convenience, I see nothing wrong
| with that whatsoever.
|
| Ideally the Internet Archive should provide an easy way
| to download sites but they don't.
|
| > $34.52 is not a reasonable price for this by any means.
|
| Why is it not reasonable? They spent time developing this
| service and it costs money to run, if you want to save
| money then yeah you can recover it yourself with some
| open-source software like wayback-machine-downloader, but
| some people just want to recover sites without having to
| bother with any of that.
| aspenmayer wrote:
| > What I'm saying is YOU said THE SITE was
| misrepresenting itself when THE SITE isn't. It would be
| BAGPUSS that was misrepresenting THE SITE.
|
| Both of these things can be true, that 'bagpuss was
| misrepresenting the site, and the site is intentionally
| vague as to what is free and what isn't so as to muddy
| the waters and paint themselves as saviors and good
| people for being open source while overcharging for a
| product to the degree that the site misrepresents itself,
| and I believe that they both are true.
|
| > Ideally the Internet Archive should provide an easy way
| to download sites but they don't.
|
| I agree, but that's not really relevant to our discussion
| or to 'bagpuss's claims.
|
| And if IA did provide an easy way to do that, the site
| linked would be an _even worse deal_.
|
| The site is misrepresenting itself as being worth paying
| for at any price.
|
| Furthermore, you can download an entire site using your
| web browser 'Save page as' -> 'web page, complete' dialog
| in conjunction with the undocumented trick:
|
| > This is undocumented, but if you retrieve a page with
| id_ after the datecode, you will get the unmodified
| original document without all the Wayback scripts, header
| stuff, and link rewriting.
|
| Seems pretty easy to me, but only if you know how. Which
| is the only reason anyone would use that site - they
| simply don't know how bad a deal the site is, or they
| have more dollars than sense.
| KomoD wrote:
| > and the site is intentionally vague as to what is free
|
| It's not? It says the CMS is free and open-source and
| they have prices listed for the paid services they
| provide.
|
| > and paint themselves as saviors and good people for
| being open source while overcharging for a product to the
| degree that the site misrepresents itself, and I believe
| that they both are true.
|
| Simply saying that something is open-source is you
| painting yourself as a "savior"?
|
| > And if IA did provide an easy way to do that, the site
| linked would be an even worse deal.
|
| Obviously, if they did provide it then there would be no
| reason at all to pay.
|
| > Furthermore, you can download an entire site using your
| web browser 'Save page as' -> 'web page, complete' dialog
| in conjunction with the undocumented trick:
|
| No, not an entire site, just the current HTML document
| and the accompanying files for it (e.g. scripts, images,
| etc.) If you want to sit for hours manually doing that
| for thousands of pages then feel free.
| aspenmayer wrote:
| Do you work for the site or something?
|
| > It's not? It says the CMS is free and open-source and
| they have prices listed for the paid services they
| provide.
|
| You brought up the CMS. I didn't. I don't have any point
| to defend regarding it. 'bagpuss was wrong about what
| they said about the site, and I replied to that.
|
| > Simply saying that something is open-source is you
| painting yourself as a "savior"?
|
| It's called marketing.
|
| Are you unfamiliar with what scammy means? The site feels
| scammy to me. So I said so. I don't think you can
| demonstrate that I _don't_ believe it's scammy, and you
| haven't convinced me either.
|
| > Obviously, if they did provide it then there would be
| no reason at all to pay.
|
| I don't have any reason to pay either. 'bagpuss can
| defend the scammy site, but I won't so I agree there's no
| reason to pay, for different reasons.
|
| > No, not an entire site, just the current HTML document
| and the accompanying files for it (e.g. scripts, images,
| etc.) If you want to sit for hours manually doing that
| for thousands of pages then feel free.
|
| I have no reason to believe a scammy site will do any
| better than that either. You haven't demonstrated that
| the site even works, and their marketing doesn't inspire
| confidence.
|
| As I didn't introduce the site, I'm not beholden to
| supporting it or not. Take 'bagpuss to task if anyone.
|
| I don't think you know what you're even arguing about or
| for because none of your arguments or claims even go
| anywhere, they all revolve around this scammy site that
| you didn't even bring up. Nothing about your argument
| makes sense.
|
| That you haven't made any effort to correct 'bagpuss by
| replying to them directly is curious.
| KomoD wrote:
| > Do you work for the site or something?
|
| Nope.
|
| > You brought up the CMS. I didn't.
|
| Yes, again, it was to explain to you that they only say
| that the CMS is free, not the services because you said:
|
| > This site is misrepresenting itself as open source and
| free, while simultaneously having an affiliate program
| and pricing page, which, as I've said, isn't free
|
| They only said their CMS was open-source and free, not
| any of their other services.
|
| > I don't think you know what you're even arguing about
| or for because none of your arguments or claims even go
| anywhere
|
| I was correcting you because you said things that just
| aren't true:
|
| > This site is misrepresenting itself as open source and
| free, while simultaneously having an affiliate program
| and pricing page, which, as I've said, isn't free. It's
| unverifiable whether or not it's open source, as you
| don't even download/run the software yourself: it's a web
| app, which is beside the point, as web apps could also be
| open source, but as there's no way to self-host this, let
| alone download it and/or run it, for free or otherwise. I
| think it's safe to avoid this scammy site.
|
| Which is just not accurate at all, as I've already
| explained several times. You can dislike the site all you
| want but you don't need to slander them.
| aspenmayer wrote:
| >> to clarify, i have nothing to do with this site, i
| used it once, years back and there was a free tier or at
| least a free/crippled version at that time
|
| https://news.ycombinator.com/item?id=42291616
|
| Per 'bagpuss, the backup _was_ free when they used it,
| and they were referring to the backup, _not_ the CMS.
|
| So, I would argue you were mistaken.
| bagpuss wrote:
| > to clarify, i have nothing to do with this site, i used it
| once, years back and there was a free tier or at least a
| free/crippled version at that time
|
| posters, enhance your calm
|
| - bagpuss, fat furry cat puss
| aspenmayer wrote:
| As it's not free anymore, do you still recommend using it, or
| do you have a different alternative recommendation in light
| of it no longer being free?
|
| I appreciate your feedback. Not sure why 'KomoD is defending
| the site, but at least you understand that it's relevant
| whether it's free or not.
| duskwuff wrote:
| > I recently bought the expired domain of a niche interest site
| because the previous owner was determined to let it die and did
| not want to put any effort in it anymore. Is there a way I can
| "revive" it from archive.org in a more or less automated fashion?
|
| Buying a domain name does not award you ownership of the content
| it previously hosted. If you have not come to some agreement with
| the previous owner, you should not proceed.
| aspenmayer wrote:
| Well we can't really assume either way, as OP was vague about
| how the site was left abandoned. They may have some arragement
| that would make this not copyright infringing. In the absence
| of any affirmative assent in writing reviewed by legal counsel,
| I'd be inclined to agree with you, and yet I sought to provide
| the best answer to the question provided, as the legal issues
| were outside the scope of the question as asked, and the legal
| issues you raised seem obvious to you and I, and ought also to
| be so to OP, but we can't make assumptions about the license of
| the content in question and/or the relevant jurisdiction(s),
| which may make these points all moot.
| moralestapia wrote:
| How's that different from the site being hosted at archive.org?
| karel-3d wrote:
| archive.org approach to copyright is "look, squirrel".
| fastily wrote:
| "Fair use"
| lhamil64 wrote:
| What if OP just had the domain redirect to the archive.org
| page? Then they wouldn't be hosting the content themselves
| donalhunt wrote:
| Did this 10+ years ago for a circa-2000 band website (was a few
| html pages). Was fairly straightforward to achieve. Some content
| (embedded from 3rd party websites) was not recoverable.
| Alifatisk wrote:
| HTTrack? You should not do it without the owners consent though.
| aspenmayer wrote:
| Seems legit.
|
| > HTTrack is a free (GPL, libre/free software) and easy-to-use
| offline browser utility.
|
| Available on Windows, Mac, Linux, and Android.
| paxys wrote:
| Have you spoken to the previous owner about any of this?
| Otherwise it's pretty crazy to just take ownership of the site
| and all its content without a written agreement in place. You are
| opening yourself up to a massive amount of liability for no
| reason.
| aspenmayer wrote:
| I agree with your points, but as the original host of the site
| no longer is continuing to host it, I doubt they would be any
| more interested in what others do regarding it, but a lawsuit
| with a potential payday might motivate them. I broadly agree
| with you though.
| moxvallix wrote:
| You can use wayback_machine_downloader to automate downloading
| the archived pages https://github.com/hartator/wayback-machine-
| downloader/
| d3VwsX wrote:
| That used to work great for me, but recently it started to
| fail. It downloads a few pages but then it gets errors, as if
| it is detected and prevented by the server from scraping.
| toomuchtodo wrote:
| > as if it is detected and prevented by the server from
| scraping.
|
| Yes.
| toast0 wrote:
| I did this for a niche site, but it was only 20 pages.
|
| I pulled each page off internet archive, saved it as an archive;
| then did some minor tidying up, setting viewports for mobile,
| updating the linkback html snippet to go to my url instead of the
| old dead one, changing the snippet to not suggest hotloading the
| link image, crop the dead url out of the link image, pngcrush the
| images, put it on cheap hosting for static pages.
|
| I did a bit of poking around trying to find a way to contact the
| owner, but had no luck. If they come back and want it down, I'll
| take it down. Copyright notices are intact. I'm clearly violating
| the author's copyrights, and I accept that.
| gopher_space wrote:
| > I'm clearly violating the author's copyrights, and I accept
| that.
|
| I'm looking at combining several old message boards into
| something useful, and I'd like to be proactive regarding
| copyright. My approach so far:
|
| - I'm assuming that everyone owns their own post/comment.
|
| - I'm assuming that submitting content meant they intended to
| grant rights to community members.
|
| - I'm assuming that work done in support of the original
| community would be welcomed by members.
|
| - And I'm assuming this all changes if I want money.
|
| So I'm preserving attributions when I can, but treat content
| like it's CC or similar as long as I'm operating within the
| original authors area of concern. Anything that actually gets
| released will be as open as possible... and probably start with
| telling you how to download files. Entirely walling off my code
| makes sense but then it is no longer a fun little project, it
| is a framework.
| janesvilleseo wrote:
| This something that used to be done quite a bit in the SEO world.
| Not sure if still holds and SEO value. Probably some, but maybe
| not the same level.
|
| Anyways there are tools out there. I haven't used them
|
| But a tool like
| https://www6.waybackmachinedownloader.com/website-downloader...
|
| Or
|
| https://websitedownloader.com/
|
| Should do the trick. Depending on the size of the site a small
| cost is involved.
|
| They can even package them into unusable files.
| canU4 wrote:
| Isn't a simple wget -r enough?
| comboy wrote:
| wget --mirror --convert-links --page-requisites --no-parent URL
|
| But yeah it's also not clear to me regarding copyrights and such.
| ulrischa wrote:
| I've seen a lot of people do this when resurrecting old niche
| sites. The high-level approach usually involves grabbing all the
| snapshots from archive.org, stripping out their timestamped URLs,
| and consolidating everything into a local mirror. In practice,
| you want to:
|
| 1. Collect a list of archived URLs (via archive.org's CDX
| endpoints). 2. Download each page and all related assets. 3.
| Rewrite all links that currently point to `web.archive.org` so
| they point to your domain or your local file paths.
|
| The tricky part is the Wayback Machine's directory structure--
| every file is wrapped in these time-stamped URLs. You'll need to
| remove those prefixes, leaving just the original directory
| layout. There's no perfect, purely automated solution, because
| sometimes assets are missing or broken. Be prepared for some
| manual cleanup.
|
| Beyond that, the process is basically: gather everything, clean
| up links, restore the original hierarchy, and then host it on
| your server. Tools exist that partially automate this (for
| example, some people have written scripts to do the CDX fetching
| and rewriting), but if you're comfortable with web scraping
| logic, you can handle it with a few careful passes. In the end,
| you'll have a mostly faithful static snapshot of the old site
| running under your revived domain.
| Sysreq2 wrote:
| You could also consider using the Common Crawl dataset provided
| by Amazon. Archive.org is more or less a wrapper around it
| anyways.
|
| https://registry.opendata.aws/commoncrawl/
| aoipoa wrote:
| This was posted 6 days ago but it's reappeared now 4 hours ago.
| What happened?
|
| https://hn.algolia.com/?q=ask+hn+resurrect+site+archive
|
| Very odd.
|
| Even the times of the comments have changed, this is what the
| post looked like yesterday:
|
| https://web.archive.org/web/20241205054108/https://news.ycom...
| denotational wrote:
| HN has a "resubmit" mechanism whereby the mods can resubmit
| interesting posts if they think they might stimulate more
| interest by being posted at a different time (or just by having
| better luck).
|
| To avoid a dupe, this mechanism post-dates the original post.
| alsetmusic wrote:
| I've been thinking about buying a sibling domain (.net instead of
| .com) to re-host a fantastic essay that disappeared from the web
| some years back. I would make it clear that I didn't write it and
| offer to remove it if the original author contacted me requesting
| that I remove it (it did not include attribution in its original
| form). But the issue has been enough of a grey area that I
| haven't pulled the trigger.
|
| For anyone who may be curious, wayback machine has an archive:
| fuckthesouth.com
| joshdavham wrote:
| Can I ask what site it was? Reading this made me think of a very
| specific site that I'd also like to see revived and I'm wondering
| if we're thinking of the same site.
___________________________________________________________________
(page generated 2024-12-06 23:02 UTC)