[HN Gopher] Ask HN: How does archive.is bypass paywalls?
___________________________________________________________________
Ask HN: How does archive.is bypass paywalls?
If it simply visits sites, it will face a paywall too. If it
identifies itself as archive.is, then other people could identify
themselves the same way.
Author : flerovium
Score : 67 points
Date : 2023-05-24 16:56 UTC (6 hours ago)
| xiekomb wrote:
| I thought they used this browser extension:
| https://gitlab.com/magnolia1234/bypass-paywalls-chrome-clean
| flerovium wrote:
| That extension does work, but do we know they use it?
| marcod wrote:
| They don't always use it, because I can archive a new page
| from my mobile phone browser, which doesn't even support
| extensions.
|
| My guess is that most content providers with paywalls serve
| the entire content, so search engines can pick it up, and
| then use scripts to raise the paywall - archive.is takes
| their snapshot before that happens / doesn't trigger those
| scripts.
| DrDentz wrote:
| It's actually the opposite, for some news sites this extension
| links to archive.is because that's the only known way to bypass
| the paywall.
| retrocryptid wrote:
| Many (most?) "big content" sites let Google and Bing spiders
| scrape the contents of articles so when people search for terms
| in the article they'll find a hit and then get referred to the
| pay wall.
|
| Google doesn't want everyone to know what a Google indexing
| request looks like for fear the CEO mafia will institute
| shenanigans. And the content providers (NYT, WaPo, etc.) don't
| want people to know 'cause they don't want people evading their
| paywall.
|
| Or maybe they're okay with letting the archive index their
| content...
| Atlas22 wrote:
| Just FYI google and bing publish their user agent strings[1][2]
| for the crawlers. At least in my experience most of the typical
| ad-infested and paywalled news sites wont display the paywall
| if you change the user agent to a crawler they prefer.
|
| [1] https://developers.google.com/search/docs/crawling-
| indexing/... [2] https://www.bing.com/webmasters/help/which-
| crawlers-does-bin...
| wolverine876 wrote:
| Doesn't almost every site on the web know exactly what the
| Google bot looks like?
| peter422 wrote:
| Google gives precise details about how to verify their bot is
| crawling your site and how to denote what content is paywalled
| and what isn't.
| gregjor wrote:
| [flagged]
| not_your_vase wrote:
| They use you, as a proxy. If you (who archives it) have access to
| the site (either because you paid or have free articles), they
| can archive it too. If you don't have access, they only archive a
| paywall.
| RicoElectrico wrote:
| Nice try, media company employee ;)
|
| /jk
| PTOB wrote:
| My sentiments exactly.
| jakedata wrote:
| Alas it doesn't allow access to the comment section of the WSJ
| which is the only reason I would visit the site. WSJ comments re-
| enforce my opinion of the majority of humanity. My father allowed
| his subscription to lapse and I won't send them my money so I
| will just have to imagine it.
| fxtentacle wrote:
| Probably the people who operate archive.is just purchased
| subscriptions for the most common newspaper sites. And then they
| can use something like https://pptr.dev/ to automate login and
| article retrieval.
|
| I guess the business model is to inject their ads into someone
| else's content, so kinda like Facebook. That would also surely
| generate more money from the ads than the cost of subscribing to
| multiple newspapers.
| panopticon wrote:
| > _Probably the people who operate archive.is just purchased
| subscriptions for the most common newspaper sites. And then
| they can use something likehttps://pptr.dev/ to automate login
| and article retrieval._
|
| I would expect to see login information rather than "Sign In"
| and "Subscribe" buttons on archived articles then. Unless
| they're stripping that from the archive?
| tivert wrote:
| > Probably the people who operate archive.is just purchased
| subscriptions for the most common newspaper sites. And then
| they can use something like https://pptr.dev/ to automate login
| and article retrieval.
|
| I wouldn't be surprised. IIRC, the whole thing is privately
| funded by one individual, who must have a lot of money to
| spare.
| Stagnant wrote:
| I don't think anyone knows who runs archive.is. I've tried
| looking into it a couple of times in the past but there is
| surprisingly little information to be found. It must cost
| thousands if not tens of thousands a month to host all that
| data and AFAIK they do not monetize it in any way. From what
| I gather it probably is some Russian person as there were
| some old stackoverflow conversations regarding the site that
| lead to an empty github account with a russian name. Also
| back in 2015 the site owner blocked all Finnish ip addresses
| due to "an incident at the border"[1]. Finnish IPs have since
| been unblocked. It appears the site owner somehow thought he
| could end up in EU wide blacklist which seemed like very
| conspiratorial thinking from him.
|
| 1: https://archive.is/Pum1p
| Swiftness6022 wrote:
| [dead]
| Hamuko wrote:
| Would it be possible to check if archive.is is logged into a
| newspaper site by archiving one of the user management pages?
| lcnPylGDnU4H9OF wrote:
| I think a browser extension which people who have access to the
| article use to send the article data to the archive server.
| flerovium wrote:
| Can you explain? Who has purchased the subscription? I'm sure
| there's a no-redistribution clause in the subscription
| agreement.
| lcnPylGDnU4H9OF wrote:
| The person who installed the browser extension would be
| paying the subscription and ignoring said clause.
| riku_iki wrote:
| curious if eventually companies with start watermarking
| articles and catch and sue extension users.
| lcnPylGDnU4H9OF wrote:
| I suspect most content publishers would go to the source.
| If there are people who are already willing to pay for
| subscriptions and ignore the terms of those
| subscriptions, it's not much of a stretch that they'll
| ignore the fact that they got their subscription
| cancelled once (or twice, or however many times). The
| publisher would more likely see results taking legal
| action against the archivist.
| dwater wrote:
| It didn't stop the RIAA from suing loads of people over
| downloading mp3s in the past 2 decades, claiming damages
| of thousands of dollars per song the individual
| downloaded.
| riku_iki wrote:
| in this case (archive.is) they have stronger case, since
| many people who potentially could buy subscription read
| it on archive.is because extension user violated terms of
| subscription.
|
| Also, extension likely has also terms of usage
| prohibiting uploading copyrighted content shifting
| liability on users.
| flerovium wrote:
| But what is the relationship between archive.is and the
| user who installed the extension?
| lcnPylGDnU4H9OF wrote:
| They (archive.is) would have built the extension to send
| the current page content to their servers and the user
| would have installed it so they can archive internet
| pages. https://help.archive.org/help/save-pages-in-the-
| wayback-mach... (item 2)
| Stagnant wrote:
| You are confusing archive.is with archive.org. Although
| archive.is does have an extension[1] it doesn't appear to
| capture any of the page contents, it just simply sends
| the url for archive.is to crawl.
|
| 1: https://chrome.google.com/webstore/detail/archive-
| page/gcaim...
| lcnPylGDnU4H9OF wrote:
| I wasn't exactly confusing them but yeah, I did link to
| an archive.org article. I was having difficulty finding
| something specific to archive.is.
|
| I think the distinction between the two is moot in this
| post. The question could very well have been "How does
| archive.org bypass paywalls?" Though it's interesting
| that archive.is seems to just crawl the URL. Indeed that
| means they wouldn't necessarily be able to bypass the
| paywall.
| inconceivable wrote:
| dude... haha it's a random person on the internet who is
| doing it for free.
| phneutral26 wrote:
| The user helps free the Internet by using archive.is as
| an openly accessible backup platform.
| AlbertCory wrote:
| That is how RECAP works ("Pacer" spelled backwards).
|
| In that case, the government is fine with it.
| wolverine876 wrote:
| I think that's how Sci-hub works, at least at some time in
| the past.
| Yujf wrote:
| I don't know about archive.is, but 12ft.io does identify as
| google to bypass paywalls afaik
| strunz wrote:
| 12ft.io also doesn't work or is disabled for many sites that
| archive.is still works on
| riffic wrote:
| your browser usually downloads an entire article and certain
| elements are overlayed.
|
| it's trivial to bypass most paywalls isn't it?
| aidenn0 wrote:
| Not for some (I think the Wall Street Journal). Apparently the
| AMP version of the page _does_ work this way for WSJ though,
| which is how IA gets around the paywall.
| mr-pink wrote:
| every time you visit they force some kid in a third world country
| to answer captchas until they can pay for one article's worth of
| content
| chrisco255 wrote:
| Just archived a website I created. It looks like it runs HTTP
| requests from a server to pull the HTML, JS and image files (it
| shows the individual requests completing before the archival
| process is complete). It must then snapshot the rendered output,
| then it renders those assets served from their domain. Buttons on
| my site don't work after the snapshot, since the scripts were
| stripped.
| strunz wrote:
| Your missing the point of "how does it bypass firewalls"
| wackget wrote:
| paywalls*
| sshine wrote:
| You're*
| w1nst0nsm1th wrote:
| If the people who know that tell you, they could lose access to
| said ressources.
|
| But it's kind of an open secret, you just don't look in the right
| place.
| jrochkind1 wrote:
| Every once in a while I _do_ get a retrieval from archive.is that
| has the paywall intact.
|
| But I don't know the answer either.
| jwildeboer wrote:
| It's internet magic. <rainbowmagicsparkles.gif> ;)
| Miner49er wrote:
| According to their blog they use AMP:
| https://blog.archive.today/post/675805841411178496/how-does-...
| JohnFen wrote:
| Wow, an actually good use for Amp? Amazing.
| flerovium wrote:
| This explanation is incomplete. Counterexample:
|
| Amp pages are paywalled:
|
| https://www.wsj.com/articles/freeze-or-cut-spending-fight-is...
| https://amp.wsj.com/articles/freeze-or-cut-spending-fight-is...
|
| archive.is isn't: https://archive.md/LaiOX
| Deathmax wrote:
| For WSJ at least, it appears that archive.is is fetching the
| AMP page, which returns the full content of the article and
| is hidden with CSS, and modifying the page to unhide the
| paywalled content + hide ads.
|
| It might be using other techniques as well for bypassing
| paywalls, be it referer/user-agent spoofing (some old
| archives of sites that echo back HTTP request headers have
| archive.is sending a Referer of google.co.uk).
| Reventlov wrote:
| I can access the wsj article without any account using
| https://gitlab.com/magnolia1234/bypass-paywalls-firefox-
| clea... (bypass paywall clean)
| alex_young wrote:
| Not specifically related to archive.is, but news sites have a
| tightrope to walk.
|
| They need to both allow the full content of their articles to be
| accessed by crawlers so they can show up in search results, but
| they also want to restrict access via paywalls. They use 2 main
| methods to achieve this: javascript DOM manipulation and IP
| address rate limiting.
|
| Conceivably one could build a system which directly accesses a
| given document one time from a unique IP address and then cache
| the HTML version of the page for further serving.
| janejeon wrote:
| > If it identifies itself as archive.is, then other people could
| identify themselves the same way.
|
| _Theoretically_ , they could just publish the list of IP ranges
| that canonically "belongs" to archive.is. That would allow
| websites to distinguish if a request identifying itself as
| archive.is is _actually_ from them (it fits one of the IP
| ranges), or is a fraudster.
| flerovium wrote:
| In theory, this might work. But is it true? Do lots of sites
| have an archive.is whitelist?
| w1nst0nsm1th wrote:
| Follow the magnolia trail...
| arbitrage wrote:
| I really don't see why they would, if they're using a paywall
| in the first place.
| lazzlazzlazz wrote:
| It would be far better and more secure for archive.is to
| publish a public key on its site and then sign requests from
| its private key, which sites could optionally verify.
| facile wrote:
| +1 on this!
| w1nst0nsm1th wrote:
| Follow the magnolia trail...
___________________________________________________________________
(page generated 2023-05-24 23:01 UTC)