[HN Gopher] Ask HN: How does archive.is bypass paywalls?
       ___________________________________________________________________
        
       Ask HN: How does archive.is bypass paywalls?
        
       If it simply visits sites, it will face a paywall too. If it
       identifies itself as archive.is, then other people could identify
       themselves the same way.
        
       Author : flerovium
       Score  : 67 points
       Date   : 2023-05-24 16:56 UTC (6 hours ago)
        
       | xiekomb wrote:
       | I thought they used this browser extension:
       | https://gitlab.com/magnolia1234/bypass-paywalls-chrome-clean
        
         | flerovium wrote:
         | That extension does work, but do we know they use it?
        
           | marcod wrote:
           | They don't always use it, because I can archive a new page
           | from my mobile phone browser, which doesn't even support
           | extensions.
           | 
           | My guess is that most content providers with paywalls serve
           | the entire content, so search engines can pick it up, and
           | then use scripts to raise the paywall - archive.is takes
           | their snapshot before that happens / doesn't trigger those
           | scripts.
        
         | DrDentz wrote:
         | It's actually the opposite, for some news sites this extension
         | links to archive.is because that's the only known way to bypass
         | the paywall.
        
       | retrocryptid wrote:
       | Many (most?) "big content" sites let Google and Bing spiders
       | scrape the contents of articles so when people search for terms
       | in the article they'll find a hit and then get referred to the
       | pay wall.
       | 
       | Google doesn't want everyone to know what a Google indexing
       | request looks like for fear the CEO mafia will institute
       | shenanigans. And the content providers (NYT, WaPo, etc.) don't
       | want people to know 'cause they don't want people evading their
       | paywall.
       | 
       | Or maybe they're okay with letting the archive index their
       | content...
        
         | Atlas22 wrote:
         | Just FYI google and bing publish their user agent strings[1][2]
         | for the crawlers. At least in my experience most of the typical
         | ad-infested and paywalled news sites wont display the paywall
         | if you change the user agent to a crawler they prefer.
         | 
         | [1] https://developers.google.com/search/docs/crawling-
         | indexing/... [2] https://www.bing.com/webmasters/help/which-
         | crawlers-does-bin...
        
         | wolverine876 wrote:
         | Doesn't almost every site on the web know exactly what the
         | Google bot looks like?
        
         | peter422 wrote:
         | Google gives precise details about how to verify their bot is
         | crawling your site and how to denote what content is paywalled
         | and what isn't.
        
       | gregjor wrote:
       | [flagged]
        
       | not_your_vase wrote:
       | They use you, as a proxy. If you (who archives it) have access to
       | the site (either because you paid or have free articles), they
       | can archive it too. If you don't have access, they only archive a
       | paywall.
        
       | RicoElectrico wrote:
       | Nice try, media company employee ;)
       | 
       | /jk
        
         | PTOB wrote:
         | My sentiments exactly.
        
       | jakedata wrote:
       | Alas it doesn't allow access to the comment section of the WSJ
       | which is the only reason I would visit the site. WSJ comments re-
       | enforce my opinion of the majority of humanity. My father allowed
       | his subscription to lapse and I won't send them my money so I
       | will just have to imagine it.
        
       | fxtentacle wrote:
       | Probably the people who operate archive.is just purchased
       | subscriptions for the most common newspaper sites. And then they
       | can use something like https://pptr.dev/ to automate login and
       | article retrieval.
       | 
       | I guess the business model is to inject their ads into someone
       | else's content, so kinda like Facebook. That would also surely
       | generate more money from the ads than the cost of subscribing to
       | multiple newspapers.
        
         | panopticon wrote:
         | > _Probably the people who operate archive.is just purchased
         | subscriptions for the most common newspaper sites. And then
         | they can use something likehttps://pptr.dev/ to automate login
         | and article retrieval._
         | 
         | I would expect to see login information rather than "Sign In"
         | and "Subscribe" buttons on archived articles then. Unless
         | they're stripping that from the archive?
        
         | tivert wrote:
         | > Probably the people who operate archive.is just purchased
         | subscriptions for the most common newspaper sites. And then
         | they can use something like https://pptr.dev/ to automate login
         | and article retrieval.
         | 
         | I wouldn't be surprised. IIRC, the whole thing is privately
         | funded by one individual, who must have a lot of money to
         | spare.
        
           | Stagnant wrote:
           | I don't think anyone knows who runs archive.is. I've tried
           | looking into it a couple of times in the past but there is
           | surprisingly little information to be found. It must cost
           | thousands if not tens of thousands a month to host all that
           | data and AFAIK they do not monetize it in any way. From what
           | I gather it probably is some Russian person as there were
           | some old stackoverflow conversations regarding the site that
           | lead to an empty github account with a russian name. Also
           | back in 2015 the site owner blocked all Finnish ip addresses
           | due to "an incident at the border"[1]. Finnish IPs have since
           | been unblocked. It appears the site owner somehow thought he
           | could end up in EU wide blacklist which seemed like very
           | conspiratorial thinking from him.
           | 
           | 1: https://archive.is/Pum1p
        
             | Swiftness6022 wrote:
             | [dead]
        
         | Hamuko wrote:
         | Would it be possible to check if archive.is is logged into a
         | newspaper site by archiving one of the user management pages?
        
       | lcnPylGDnU4H9OF wrote:
       | I think a browser extension which people who have access to the
       | article use to send the article data to the archive server.
        
         | flerovium wrote:
         | Can you explain? Who has purchased the subscription? I'm sure
         | there's a no-redistribution clause in the subscription
         | agreement.
        
           | lcnPylGDnU4H9OF wrote:
           | The person who installed the browser extension would be
           | paying the subscription and ignoring said clause.
        
             | riku_iki wrote:
             | curious if eventually companies with start watermarking
             | articles and catch and sue extension users.
        
               | lcnPylGDnU4H9OF wrote:
               | I suspect most content publishers would go to the source.
               | If there are people who are already willing to pay for
               | subscriptions and ignore the terms of those
               | subscriptions, it's not much of a stretch that they'll
               | ignore the fact that they got their subscription
               | cancelled once (or twice, or however many times). The
               | publisher would more likely see results taking legal
               | action against the archivist.
        
               | dwater wrote:
               | It didn't stop the RIAA from suing loads of people over
               | downloading mp3s in the past 2 decades, claiming damages
               | of thousands of dollars per song the individual
               | downloaded.
        
               | riku_iki wrote:
               | in this case (archive.is) they have stronger case, since
               | many people who potentially could buy subscription read
               | it on archive.is because extension user violated terms of
               | subscription.
               | 
               | Also, extension likely has also terms of usage
               | prohibiting uploading copyrighted content shifting
               | liability on users.
        
             | flerovium wrote:
             | But what is the relationship between archive.is and the
             | user who installed the extension?
        
               | lcnPylGDnU4H9OF wrote:
               | They (archive.is) would have built the extension to send
               | the current page content to their servers and the user
               | would have installed it so they can archive internet
               | pages. https://help.archive.org/help/save-pages-in-the-
               | wayback-mach... (item 2)
        
               | Stagnant wrote:
               | You are confusing archive.is with archive.org. Although
               | archive.is does have an extension[1] it doesn't appear to
               | capture any of the page contents, it just simply sends
               | the url for archive.is to crawl.
               | 
               | 1: https://chrome.google.com/webstore/detail/archive-
               | page/gcaim...
        
               | lcnPylGDnU4H9OF wrote:
               | I wasn't exactly confusing them but yeah, I did link to
               | an archive.org article. I was having difficulty finding
               | something specific to archive.is.
               | 
               | I think the distinction between the two is moot in this
               | post. The question could very well have been "How does
               | archive.org bypass paywalls?" Though it's interesting
               | that archive.is seems to just crawl the URL. Indeed that
               | means they wouldn't necessarily be able to bypass the
               | paywall.
        
               | inconceivable wrote:
               | dude... haha it's a random person on the internet who is
               | doing it for free.
        
               | phneutral26 wrote:
               | The user helps free the Internet by using archive.is as
               | an openly accessible backup platform.
        
         | AlbertCory wrote:
         | That is how RECAP works ("Pacer" spelled backwards).
         | 
         | In that case, the government is fine with it.
        
           | wolverine876 wrote:
           | I think that's how Sci-hub works, at least at some time in
           | the past.
        
       | Yujf wrote:
       | I don't know about archive.is, but 12ft.io does identify as
       | google to bypass paywalls afaik
        
         | strunz wrote:
         | 12ft.io also doesn't work or is disabled for many sites that
         | archive.is still works on
        
       | riffic wrote:
       | your browser usually downloads an entire article and certain
       | elements are overlayed.
       | 
       | it's trivial to bypass most paywalls isn't it?
        
         | aidenn0 wrote:
         | Not for some (I think the Wall Street Journal). Apparently the
         | AMP version of the page _does_ work this way for WSJ though,
         | which is how IA gets around the paywall.
        
       | mr-pink wrote:
       | every time you visit they force some kid in a third world country
       | to answer captchas until they can pay for one article's worth of
       | content
        
       | chrisco255 wrote:
       | Just archived a website I created. It looks like it runs HTTP
       | requests from a server to pull the HTML, JS and image files (it
       | shows the individual requests completing before the archival
       | process is complete). It must then snapshot the rendered output,
       | then it renders those assets served from their domain. Buttons on
       | my site don't work after the snapshot, since the scripts were
       | stripped.
        
         | strunz wrote:
         | Your missing the point of "how does it bypass firewalls"
        
           | wackget wrote:
           | paywalls*
        
           | sshine wrote:
           | You're*
        
       | w1nst0nsm1th wrote:
       | If the people who know that tell you, they could lose access to
       | said ressources.
       | 
       | But it's kind of an open secret, you just don't look in the right
       | place.
        
       | jrochkind1 wrote:
       | Every once in a while I _do_ get a retrieval from archive.is that
       | has the paywall intact.
       | 
       | But I don't know the answer either.
        
       | jwildeboer wrote:
       | It's internet magic. <rainbowmagicsparkles.gif> ;)
        
       | Miner49er wrote:
       | According to their blog they use AMP:
       | https://blog.archive.today/post/675805841411178496/how-does-...
        
         | JohnFen wrote:
         | Wow, an actually good use for Amp? Amazing.
        
         | flerovium wrote:
         | This explanation is incomplete. Counterexample:
         | 
         | Amp pages are paywalled:
         | 
         | https://www.wsj.com/articles/freeze-or-cut-spending-fight-is...
         | https://amp.wsj.com/articles/freeze-or-cut-spending-fight-is...
         | 
         | archive.is isn't: https://archive.md/LaiOX
        
           | Deathmax wrote:
           | For WSJ at least, it appears that archive.is is fetching the
           | AMP page, which returns the full content of the article and
           | is hidden with CSS, and modifying the page to unhide the
           | paywalled content + hide ads.
           | 
           | It might be using other techniques as well for bypassing
           | paywalls, be it referer/user-agent spoofing (some old
           | archives of sites that echo back HTTP request headers have
           | archive.is sending a Referer of google.co.uk).
        
           | Reventlov wrote:
           | I can access the wsj article without any account using
           | https://gitlab.com/magnolia1234/bypass-paywalls-firefox-
           | clea... (bypass paywall clean)
        
       | alex_young wrote:
       | Not specifically related to archive.is, but news sites have a
       | tightrope to walk.
       | 
       | They need to both allow the full content of their articles to be
       | accessed by crawlers so they can show up in search results, but
       | they also want to restrict access via paywalls. They use 2 main
       | methods to achieve this: javascript DOM manipulation and IP
       | address rate limiting.
       | 
       | Conceivably one could build a system which directly accesses a
       | given document one time from a unique IP address and then cache
       | the HTML version of the page for further serving.
        
       | janejeon wrote:
       | > If it identifies itself as archive.is, then other people could
       | identify themselves the same way.
       | 
       |  _Theoretically_ , they could just publish the list of IP ranges
       | that canonically "belongs" to archive.is. That would allow
       | websites to distinguish if a request identifying itself as
       | archive.is is _actually_ from them (it fits one of the IP
       | ranges), or is a fraudster.
        
         | flerovium wrote:
         | In theory, this might work. But is it true? Do lots of sites
         | have an archive.is whitelist?
        
           | w1nst0nsm1th wrote:
           | Follow the magnolia trail...
        
           | arbitrage wrote:
           | I really don't see why they would, if they're using a paywall
           | in the first place.
        
         | lazzlazzlazz wrote:
         | It would be far better and more secure for archive.is to
         | publish a public key on its site and then sign requests from
         | its private key, which sites could optionally verify.
        
           | facile wrote:
           | +1 on this!
        
       | w1nst0nsm1th wrote:
       | Follow the magnolia trail...
        
       ___________________________________________________________________
       (page generated 2023-05-24 23:01 UTC)