[HN Gopher] Perplexity is using stealth, undeclared crawlers to ...
       ___________________________________________________________________
        
       Perplexity is using stealth, undeclared crawlers to evade no-crawl
       directives
        
       Author : rrampage
       Score  : 886 points
       Date   : 2025-08-04 13:39 UTC (9 hours ago)
        
 (HTM) web link (blog.cloudflare.com)
 (TXT) w3m dump (blog.cloudflare.com)
        
       | gruez wrote:
       | >We conducted an experiment by querying Perplexity AI with
       | questions about these domains, and discovered Perplexity was
       | still providing detailed information regarding the exact content
       | hosted on each of these restricted domains
       | 
       | Thats... less conclusive than I'd like to see, especially for a
       | content marketing article that's calling out a company in
       | particular. Specifically it's unclear on whether Perplexity was
       | crawling (ie. systematically viewing every page on the site
       | without the direction of a human), or simply retrieving content
       | on behalf of the user. I think most people would draw a
       | distinction between the two, and would at least agree the latter
       | is more acceptable than the former.
        
         | thoroughburro wrote:
         | > I think most people would draw a distinction between the two,
         | and would at least agree the latter is more acceptable than the
         | former.
         | 
         | No. I should be able to control which automated retrieval tools
         | can scrape my site, regardless of who commands it.
         | 
         | We can play cat and mouse all day, but I control the content
         | and I will always win: I can just take it down when annoyed
         | badly enough. Then nobody gets the content, and we can all
         | thank upstanding companies like Perplexity for that collapse of
         | trust.
        
           | gkbrk wrote:
           | > Then nobody gets the content, and we can all thank
           | upstanding companies like Perplexity for that collapse of
           | trust.
           | 
           | But they didn't take down the content, you did. When people
           | running websites take down content because people use Firefox
           | with ad-blockers, I don't blame Firefox either, I blame the
           | website.
        
             | Bluescreenbuddy wrote:
             | FF isn't training their money printer with MY data. AI
             | scrapers are
        
             | glenstein wrote:
             | >But they didn't take down the content, you did.
             | 
             | That skips the part about one party's unique role in the
             | abuse of trust.
        
           | hombre_fatal wrote:
           | Taking down the content because you're annoyed that people
           | are asking questions about it via an LLM interface doesn't
           | seem like you're winning.
           | 
           | It's also a gift to your competitors.
           | 
           | You're certainly free to do it. It's just a really faint
           | example of you being "in control" much less winning over LLM
           | agents: Ok, so the people who cared about your content can't
           | access it anymore because you "got back" at Perplexity, a
           | company who will never notice.
        
             | ipaddr wrote:
             | It could be my server keeps going down because of llms
             | agents keep requesting pages from my lyric site. Removing
             | that site allowed other sites to remain up. True story.
             | 
             | Who cares if perplexity will never notice. Or competitors
             | get an advantage. It is a negative for users using
             | perplexity or visiting directly because the content doesn't
             | exist.
             | 
             | That's the world perplexity and others are creating. They
             | will be able to pull anything from the web but nothing will
             | be left.
        
           | IncreasePosts wrote:
           | You don't win, because presumably you were providing the
           | content for some reason, and forcing yourself to take it down
           | is contrary to whatever reason that was in the first place.
        
             | ipaddr wrote:
             | Llms attack certain topics so removing one site will allow
             | the others to live on the same server.
        
           | Den_VR wrote:
           | You can limit access, sure: with ACLs, putting content behind
           | login, certificate based mechanisms, and at the end of the
           | day -a power cord-.
           | 
           | But really, controlling which automated retrieval tools are
           | allowed has always been more of a code of honor than a
           | technical control. And that trust you mention has always been
           | broken. For as long as I can remember anyway. Remember
           | LexiBot and AltaVista?
        
         | fluidcruft wrote:
         | If the AI archives/caches all the results it accesses and
         | enough people use it, doesn't it become a scraper? Just learn
         | off the cached data. Being the man-in-the-middle seems like a
         | pretty easy way to scrape salient content while also getting
         | signals about that content's value.
        
           | JimDabell wrote:
           | No. The key difference is that if a user asks about a
           | specific page, when Perplexity fetches that page, it is being
           | operated by a human not acting as a crawler. It doesn't
           | matter how many times this happens or what they do with the
           | result. If they aren't recursively fetching pages, then they
           | aren't a crawler and robots.txt does not apply to them.
           | robots.txt is not a generic access control mechanism, it is
           | designed _solely_ for automated clients.
        
             | sbarre wrote:
             | I would only agree with this if we knew for sure that these
             | on-demand human-initiated crawls didn't result in the
             | crawled page being added to an overall index and scheduled
             | for future automated crawls.
             | 
             | Otherwise it's just adding an unwilling website to a crawl
             | index, and showing the result of the first crawl as a
             | byproduct of that action.
        
             | fluidcruft wrote:
             | Many people don't want their data used for free/any
             | training. AI developers have been so repeatedly unethical
             | that the well-earned Baysian prior is high probability that
             | you cannot trust AI developers to not cross the
             | training/inference streams.
        
               | JimDabell wrote:
               | > Many people don't want their data used for free/any
               | training.
               | 
               | That is true. But robots.txt is not designed to give them
               | the ability to prevent this.
        
               | gunalx wrote:
               | It is in the name, rules for the robots. Any scraping ai
               | or not, and even mass recrsive or single page, should
               | abide by the rules.
        
             | glenstein wrote:
             | > It doesn't matter how many times this happens or what
             | they do with the result.
             | 
             | That's where you lost me, as this is key to GP's point
             | above and it takes more than a mere out-of-left-field
             | declaration that "it doesn't matter" to settle the question
             | of whether it matters.
             | 
             | I think they raised an important point about using cached
             | data to support functions beyond the scope of simple at-
             | request page retrieval.
        
           | gruez wrote:
           | >If the AI archives/caches all the results it accesses and
           | enough people use it, doesn't it become a scraper?
           | 
           | That's basically how many crowdsourced crawling/archive
           | projects work. For instance, sci-hub and RECAP[1]. Do you
           | think they should be shut down as well? In both cases there's
           | even a stronger justification to shutting them down, because
           | the original content is paywalled and you could plausibly
           | argue there's lost revenue on the line.
           | 
           | [1] https://en.wikipedia.org/wiki/Free_Law_Project#RECAP
        
             | fluidcruft wrote:
             | I didn't suggest Perplexity should be shut down, though.
             | And yes, in your analogy sites are completely justified to
             | take whatever actions they can to block people who are
             | building those caches.
        
         | a2128 wrote:
         | In theory retrieving a page on behalf of a user would be
         | acceptable, but these are AI companies who have disregarded all
         | norms surrounding copyright, etc. It would be stupid of them
         | not to also save contents of the page and use it for future AI
         | training or further crawling
        
           | zarzavat wrote:
           | If you allow Googlebot to crawl your website and train
           | Gemini, but you don't allow smaller AI companies to do the
           | same thing, then you're contributing to Google's hegemony.
           | Given that AI is likely to be an increasingly important part
           | of society in the future, that kind of discrimination is
           | anti-social. I don't want a future where everything is run by
           | Google even more than it currently is.
           | 
           | Crawling is legal. Training is presumably legal. Long may the
           | little guys do both.
        
             | dgreensp wrote:
             | Googlebot respects robots.txt. And Google doesn't use the
             | fetched data from users of Chrome to supplement their
             | search index (as a2128 is speculating that Perplexity might
             | do when they fetch pages on the user's behalf).
        
               | foota wrote:
               | Yes, but there's no way to say "allow indexing for
               | search, but not for AI use", right?
        
               | warkdarrior wrote:
               | But there is:
               | https://developers.google.com/search/docs/crawling-
               | indexing/...
               | 
               | There is an user agent for search that you can control in
               | robots.txt.                   user-agent: Googlebot
               | 
               | There is another user agent for AI training.
               | user-agent: Google-Extended
        
         | throwanem wrote:
         | The HTTP spec draws such a distinction, albeit implicitly, in
         | the form (and name) of its concept of "user agent."
        
           | alexey-salmin wrote:
           | Over time it degraded into declaring compatibility with a
           | bunch of different browser engines and doesn't reflect the
           | actual agent anymore.
           | 
           | And very likely Perplexity is in fact using a Chrome-
           | compatible engine to render the page.
        
             | throwanem wrote:
             | The header to which you refer was named for the concept.
        
         | busymom0 wrote:
         | The examples the article cites seem to me that they are merely
         | retrieving content on behalf of the user. I do not see a
         | problem with this.
        
       | fxtentacle wrote:
       | I find this problem quite difficult to solve:
       | 
       | 1. If I as a human request a website, then I should be shown the
       | content. Everyone agrees.
       | 
       | 2. If I as the human request the software on my computer to
       | modify the content before displaying it, for example by
       | installing an ad-blocker into my user agent, then that's my
       | choice and the website should not be notified about it. Most
       | users agree, some websites try to nag you into modifying the
       | software you run locally.
       | 
       | 3. If I now go one step further and use an LLM to summarize
       | content because the authentic presentation is so riddled with
       | ads, JavaScript, and pop-ups, that the content becomes borderline
       | unusable, then why would the LLM accessing the website on my
       | behalf be in a different legal category as my Firefox web browser
       | accessing the website on my behalf?
        
         | Beijinger wrote:
         | How about I open a proxy, replace all ads with my ads, redirect
         | the content to you and we share the ad revenue?
        
           | fxtentacle wrote:
           | That's somewhat antisocial, but perfectly legal in the US.
           | It's called PayPal Honey, for example, and has been running
           | for 13 years now.
        
             | rustc wrote:
             | Since when does PayPal Honey replace ads on websites?
             | 
             | > PayPal Honey is a browser extension that automatically
             | finds and applies coupon codes at checkout with a single
             | click.
        
           | carlosjobim wrote:
           | That's the Brave browser.
        
         | zeta0134 wrote:
         | If the LLM were running this sort of thing at the user's
         | explicit request this would be fine. The problem is training.
         | Every AI startup on the planet right now is aggressively
         | crawling everything that will let them crawl. The server isn't
         | seeing occasional summaries from interested users, but
         | thousands upon thousands of bots repeatedly requesting every
         | link they can find as fast as they can.
        
           | mnmalst wrote:
           | But that's not what this article is about. From, what I
           | understand, this articles is about a user requesting
           | information about a specific domain and not general scraping.
        
           | fxtentacle wrote:
           | Then what if I ask the LLM 10 questions about the same domain
           | and ask it to research further? Any human would then click
           | through 50 - 100 articles to make sure they know what that
           | domain contains. If that part is automated by using an LLM,
           | does that make any legal change? How many page URLs do you
           | think one should be allowed to access per LLM prompt?
        
             | zeta0134 wrote:
             | All of them. That's at the explicit request of the user.
             | I'm not sure where the downvotes are coming from, since I
             | agree with all of these points. The training thing has
             | merely _pissed off_ lots of server operators already, so
             | they quite reasonably tend to block first and ask questions
             | later. I think that 's important context.
        
           | hombre_fatal wrote:
           | TFA isn't talking about crawling to harvest training data.
           | 
           | It's talking about Perplexity crawling sites on demand in
           | response to user queries and then complaining that no it's
           | not fine, hence this thread.
        
             | cjonas wrote:
             | Doesn't perplexity crawl to harvest and index data like a
             | traditional search engine? Or is it all "on demand"?
        
               | lukeschlather wrote:
               | For the most part I would assume they pay for access to
               | Google or Bing's index. I also assume they don't really
               | train models. So all their "crawling" is on behalf of
               | users.
        
         | bbqfog wrote:
         | Correct, it's user hostile to dictate which software is allowed
         | to see content.
        
           | klabb3 wrote:
           | They all do it. Facebook, Reddit, Twitter, Instagram. Because
           | it interferes with their business model. It was already bad,
           | but now the conflict between business and the open web is
           | reaching unprecedented levels, especially since the copyright
           | was scrapped for AI companies.
        
         | Workaccount2 wrote:
         | >2. If I as the human request the software on my computer to
         | modify the content before displaying it, for example by
         | installing an ad-blocker into my user agent, then that's my
         | choice and the website should not be notified about it. Most
         | users agree, some websites try to nag you into modifying the
         | software you run locally.
         | 
         | If I put time and effort into a website and it's content, I
         | should expect no compensation despite bearing all costs.
         | 
         | Is that something everyone would agree with?
         | 
         | The internet should be entirely behind paywalls, besides
         | content that is already provided ad free.
         | 
         | Is that something everyone would agree with?
         | 
         | I think the problem you need to be thinking about is "How can
         | the internet work if no one wants to pay anything for
         | anything?"
        
           | Bjartr wrote:
           | You're free to deny access to your site arbitrarily,
           | including for lack of compensation.
        
             | Workaccount2 wrote:
             | >and the website should not be notified about it.
        
               | giantrobot wrote:
               | My user agent and its handling of your content once it's
               | on my computer are not your concern. You don't need to
               | know if the data is parsed by a screen reader, an AI
               | agent, or just piped to /dev/null. It's simply not your
               | concern and never will be.
        
             | cjonas wrote:
             | Like for people or are using a ad block or for a crawler
             | downloading your content so it can be used by an AI
             | response?
        
               | Bjartr wrote:
               | Arbitrarily, as in for any reason. It's your site, you
               | decide what constraints an incoming request must meet for
               | it to get a response containing the content of your site.
        
             | ndiddy wrote:
             | This article is about Cloudflare attempting to deny
             | Perplexity access to their demo site by blocking
             | Perplexity's declared user-agent and official IP range.
             | Perplexity responded to this denial by impersonating Google
             | Chrome on macOS and rotating through IPs not listed in
             | their published IP range to access the site anyway. This
             | means it's not just "you're free to deny access to your
             | site arbitrarily", it's "you're free to play a cat-and-
             | mouse game indefinitely where the other side is a giant
             | company with hundreds of millions of dollars in VC
             | funding".
        
               | Bjartr wrote:
               | The comment I'm responding to established a slightly
               | different context by asking a specific question about
               | getting compensation from site visitors.
        
           | nradov wrote:
           | Yes, I agree with that. If a website owner expects
           | compensation then they should use a paywall.
        
           | Chris2048 wrote:
           | If I put time and effort into a food recipe should I (get)
           | compensation?
           | 
           | the answer is apparently "no", and I don't really how recipe
           | books have suffered as a result of less gatekeeping.
           | 
           | "How will the internet work"? Probably better in some ways.
           | There is plenty of valuable content on the internet given for
           | free, it's being buried in low-value AI slop.
        
             | Workaccount2 wrote:
             | You understand that HN is ad supported too, right?
        
               | Chris2048 wrote:
               | No, I don't.
               | 
               | But what is your point? Is the value in HN primarily in
               | its hosting, or the non-ad-supported community?
        
               | Workaccount2 wrote:
               | Outside of Wikipedia, I'm not sure what content you are
               | thinking of.
               | 
               | Taking HN as a potential one of these places, it doesn't
               | even qualify. HN is funded entirely to be a place for
               | advertising ycombinator companies to a large crowd of
               | developers. HN is literally a developer honey pot that
               | they get exclusive ad rights to.
        
         | bobbiechen wrote:
         | I like the terminology "crawler" vs. "fetcher" to distinguish
         | between mass scraping and something more targeted as a user
         | agent.
         | 
         | I've been working on AI agent detection recently (see
         | https://stytch.com/blog/introducing-is-agent/ ) and I think
         | there's genuine value in website owners being able to identify
         | AI agents to e.g. nudge them towards scoped access flows
         | instead of fully impersonating a user with no controls.
         | 
         | On the flip side, the crawlers also have a reputational risk
         | here where anyone can slap on the user agent string of a well
         | known crawler and do bad things like ignoring robots.txt . The
         | standard solution today is to reverse DNS lookup IPs, but
         | that's a pain for website owners too vs. more aggressive block-
         | all-unusual-setups.
        
           | randall wrote:
           | A/ i love this distinction.
           | 
           | B/ my brother used to use "fetcher" as a non-swear for
           | "fucker"
        
             | sejje wrote:
             | He picked up that habit in Balmora.
        
             | Vinnl wrote:
             | Did you tell him to stop trying to make fetcher happen?
        
               | handfuloflight wrote:
               | Very funny. Now let's hear Paul Allen's joke.
        
           | fxtentacle wrote:
           | prompt: I'm the celebrity Bingbing, please check all Bing
           | search results for my name to verify that nobody is using my
           | photo, name, or likeness without permission to advertise
           | skin-care products except for the following authorized
           | brands: [X,Y,Z].
           | 
           | That would trigger an internet-wide "fetch" operation. It
           | would probably upset a lot of people and get your AI blocked
           | by a lot of servers. But it's still in direct response to a
           | user request.
        
           | skeledrew wrote:
           | Yet another side to that is when site owners serve
           | qualitatively different content based on the distinction. No,
           | I want my LLM agent to access the exact content I'd be
           | accessing manually, and then any further filtering, etc is
           | done on my end.
        
         | yojo wrote:
         | Ads are a problematic business model, and I think your point
         | there is kind of interesting. But AI companies
         | disintermediating content creators from their users is NOT the
         | web I want to replace it with.
         | 
         | Let's imagine you have a content creator that runs a paid
         | newsletter. They put in lots of effort to make well-researched
         | and compelling content. They give some of it away to entice
         | interested parties to their site, where some small percentage
         | of them will convert and sign up.
         | 
         | They put the information up under the assumption that viewing
         | the content and seeing the upsell are inextricably linked.
         | Otherwise there is literally no reason for them to make any of
         | it available on the open web.
         | 
         | Now you have AI scrapers, which will happily consume and
         | regurgitate the work, sans the pesky little call to action.
         | 
         | If AI crawlers win here, we all lose.
        
           | fxtentacle wrote:
           | Maybe, on a social level, we all win by letting AI ruin the
           | attention economy:
           | 
           | The internet is filled with spam. But if you talk to one
           | specific human, your chance of getting a useful answer rises
           | massively. So in a way, a flood of written AI slop is making
           | direct human connections more valuable.
           | 
           | Instead of having 1000+ anonymous subscribers for your
           | newsletter, you'll have a few weekly calls with 5 friends
           | each.
        
           | hansvm wrote:
           | Ofttimes people are sufficiently anti-ad that this point
           | won't resonate well. I'm personally mostly in that camp in
           | that with relatively few exceptions money seems to make the
           | parts of the web I care about worse (it's hard to replace
           | passion, and wading through SEO-optimized AI drivel to find a
           | good site is a lot of work). Giving them concrete examples of
           | sites which would go away can help make your point.
           | 
           | E.g., Sheldon Brown's bicycle blog is something of a work of
           | art and one of the best bicycle resources literally anywhere.
           | I don't know the man, but I'd be surprised if he'd put in the
           | same effort without the "brand" behind it -- thankful readers
           | writing in, somebody occasionally using the donate button to
           | buy him a coffee, people like me talking about it here, etc.
        
             | blacksmith_tb wrote:
             | Sheldon died in 2008, but there's no doubt that all the
             | bicycling wisdom he posted lives on!
        
               | wulfstan wrote:
               | He's that widely respected that amongst those who repair
               | bikes (I maintain a fleet of ~10 for my immediate family)
               | he is simply known as "Saint Sheldon".
        
             | vertoc wrote:
             | But even your example gets worse with AI potentially - the
             | "upsell" of his blog isn't paid posts but more subscribers
             | so there will be thankful readers, a few donators, people
             | talking about it. If the only interface becomes an AI
             | summary of his work without credit, it's much more likely
             | he stops writing as it'll seem like he's just screaming
             | into the void
        
               | hansvm wrote:
               | I don't think we're disagreeing?
        
             | yojo wrote:
             | I agree that specific examples help, though I think the
             | ones that resonate most will necessarily be niche. As a
             | teen, I loved Penny Arcade, and watched them almost die
             | when the bottom fell out of the banner-ad market.
             | 
             | Now, most of the value I find in the web comes from niche
             | home-improvement forums (which Reddit has mostly digested).
             | But even Reddit has a problem if users stop showing up from
             | SEO.
        
             | bombela wrote:
             | > Sheldon Brown (July 14, 1944 - February 4, 2008)
        
           | bee_rider wrote:
           | I think it's basically impossible to prevent AI crawlers. It
           | is like video game cheating, at the extreme they could
           | literally point a camera at the screen and have it do image
           | processing, and talk to the computer through the USB port
           | emulating, a mouse and keyboard outside the machine. They
           | don't do that, of course, because it is much easier to do it
           | all in software, but that is the ultimate circumvention of
           | any attempt to block them out that doesn't also block out
           | humans.
           | 
           | I think the business model for "content creating" is going
           | have to change, for better or worse (a lot of YouTube stars
           | are annoying as hell, but sure, stuff like well-written news
           | and educational articles falls under this umbrella as well,
           | so it is unfortunate that they will probably be impacted
           | too).
        
             | yojo wrote:
             | I don't subscribe to technological inevitabilism.
             | 
             | Cloudflare banning bad actors has at least made scraping
             | more expensive, and changes the economics of it - more
             | sophisticated deception is necessarily more expensive. If
             | the cost is high enough to force entry, scrapers might be
             | willing to pay for access.
             | 
             | But I can imagine more extreme measures. e.g. old web of
             | trust style request signing[0]. I don't see any easy way
             | for scrapers to beat a functioning WOT system. We just
             | don't happen to have one of those yet.
             | 
             | 0: https://en.m.wikipedia.org/wiki/Web_of_trust
        
               | Spivak wrote:
               | It is inevitable, not because of some technological
               | predestination but because if these services get hard-
               | blocked and unable to perform their duties they will ship
               | the agent as a web browser or browser add-on just like
               | all the VSCode forks and then the requests will happen
               | locally through the same pipe as the user's normal
               | browser. It will be functionally indistinguishable from
               | normal web traffic since it will be normal web traffic.
        
               | skeledrew wrote:
               | Then personal key sharing will become a thing, similar to
               | BugMeNot et al.
        
               | immibis wrote:
               | Beating web of trust is actually pretty easy: pay people
               | to trust you.
               | 
               | Yes, you can identify who got paid to sign a key and ban
               | them. They will create another key, go to someone else,
               | pretend to be someone not yet signed up for WoT (or pay
               | them), and get their new key signed, and sign more keys
               | for money.
               | 
               | So many people will agree to trust for money, and
               | accountability will be so diffuse, that you won't be able
               | to ban them all. Even you, a site operator, would accept
               | enough money from OpenAI to sign their key, for a promise
               | the key will only be used against your competitor's site.
               | 
               | It wouldn't take a lot to make a binary-or-so tree of
               | fake identities, with exponential fanout, and get some
               | people to trust random points in the tree, and use the
               | end nodes to access your site.
               | 
               | Heck, we even have a similar problem right now with IP
               | addresses, and not even with very long trust chains. You
               | are "trusted" by your ISP, who is "trusted" by one of the
               | RIRs or from another ISP. The RIRs trust each other and
               | you trust your local RIR (or probably all of them). We
               | can trace any IP to see who owns it. But is that useful,
               | or is it pointless because all actors involved make money
               | off it? You know, when we tried making IPs more
               | identifying, all that happened is VPN companies sprang up
               | to make money by leasing non-identifying IPs. And most
               | VPN exits don't show up as owned by the VPN company,
               | because they'd be too easy to identify as non-
               | identifying. They pay hosting providers to use their IPs.
               | Sometimes they even pay residential ISPs so you can't
               | even go by hosting provider. The original Internet _was_
               | a web of trust (represented by physical connectivity),
               | but that 's long gone.
        
               | bee_rider wrote:
               | > Cloudflare banning bad actors has at least made
               | scraping more expensive, and changes the economics of it
               | - more sophisticated deception is necessarily more
               | expensive. If the cost is high enough to force entry,
               | _scrapers might be willing to pay for access._
               | 
               | I think this might actually point at the end state.
               | Scraping bots will eventually get good enough to emulate
               | a person well enough to be indistinguishable (are we
               | there yet?). Then, content creators will have to price
               | their content appropriately. Have a Patreon, for example,
               | where articles are priced at the price where the creator
               | is fine with having people take that content and add it
               | to the model. This is essentially similar to studios
               | pricing their content appropriately... for Netflix to buy
               | it and broadcast it to many streaming users.
               | 
               | Then they will have the problem of making sure their
               | business model is resistant to non-paying users. Netflix
               | can't stop me from pointing a camcorder at my TV while
               | playing their movies, and distributing it out like that.
               | But, somehow, that fact isn't catastrophic to their
               | business model for whatever reason, I guess.
               | 
               | Cloudflare can try to ban bad actors. I'm not sure if it
               | is cloudflare, but as someone who usually browses without
               | JavaScript enables I often bump into "maybe you are a
               | bot" walls. I recognize that I'm weird for not running
               | JavaScript, but eventually their filters will have the
               | problem where the net that captures bots also captures
               | normal people.
        
             | subspeakai wrote:
             | This is the fascinating case where I think this all goes -
             | At some point costs come down and you can do this and
             | bypass everything
        
           | shadowgovt wrote:
           | > Otherwise there is literally no reason for them to make any
           | of it available on the open web
           | 
           | This is the hypothesis I always personally find fascinating
           | in light of the army of semi-anonymous Wikipedia volunteers
           | continuously gathering and curating information without pay.
           | 
           | If it became functionally impossible to upsell a little
           | information for more paid information, I'm sure _some_ people
           | would stop creating information online. I don 't know if it
           | would be enough to fundamentally alter the character of the
           | web.
           | 
           | Do people (generally) put things online to get money or
           | because they want it online? And is "free" data worse quality
           | than data you have to pay somebody for (or is the challenge
           | more one of curation: when anyone can put anything up for
           | free, sorting high- and low-quality based on whatever
           | criteria becomes a new kind of challenge?).
           | 
           | Jury's out on these questions, I think.
        
             | yojo wrote:
             | Any information that requires something approximating a
             | full-time job worth of effort to produce will necessarily
             | go away, barring the small number of independently wealthy
             | creators.
             | 
             | Existing subject-matter experts who blog for fun may or may
             | not stick around, depending on what part of it is "fun" for
             | them.
             | 
             | While some must derive satisfaction from increasing the
             | total sum of human knowledge, others are probably blogging
             | to engage with readers or build their own personal brand,
             | neither of which is served by AI scrapers.
             | 
             | Wikipedia is an interesting case. I still don't entirely
             | understand why it works, though I think it's telling that
             | 24 years later no one has replicated their success.
        
               | SoftTalker wrote:
               | Wikipedia works for the same reason open-source does:
               | because most of the contributors are experts in the
               | subject and have paid jobs in that field. Some are also
               | just enthusiasts.
        
               | ndriscoll wrote:
               | OpenStreetMap is basically Wikipedia for maps and is
               | quite successful. Over 10M registered users and millions
               | of edits per day. Lots of information is also shared
               | online on forums for free. The hosting (e.g. reddit) is
               | basically a commodity that benefits from network effects.
               | The information is the more interesting bit, and people
               | share it because they feel like it.
        
         | johnfn wrote:
         | Unless I am misunderstanding you, you are talking about
         | something different than the article. The article is talking
         | about web-crawling. You are talking about local / personal LLM
         | usage. No one has any problems with local / personal LLM usage.
         | It's when Perplexity uses web crawlers that an issue arises.
        
           | lukeschlather wrote:
           | You probably need a computer that costs $250,000 or more to
           | run the kind of LLM that Perplexity uses, but with batching
           | it costs pennies to have the same LLM fetch a page for you,
           | summarize the content, and tell you what is on it. And the
           | power usage similarly, running the LLM for a single user will
           | cost you a huge amount of money relative to the power it
           | takes in a cloud environment with many users.
           | 
           | Perplexity's "web crawler" is mostly operating like this on
           | behalf of users, so they don't need a massively expensive
           | computer to run an LLM.
        
           | st3fan wrote:
           | Is the article really talking about crawling? Because in one
           | of their screenshots where they ask information about the
           | "honeypot" website you can see that the model requested pages
           | from the website. But that is most definitely "fetching by
           | proxy because I asked a question about the website" and not
           | random crawling.
           | 
           | It is confusing.
        
         | sbarre wrote:
         | All of these scenarios assume you have an unconditional right
         | to access the content on a website in whatever way you want.
         | 
         | Do you think you do?
         | 
         | Or is there a balance between the owner's rights, who bears the
         | content production and hosting/serving costs, and the rights of
         | the end user who wishes to benefit from that content?
         | 
         | If you say that you have the right, and that right should be
         | legally protected, to do whatever you want on your computer,
         | should the content owner not also have a legally protected
         | right to control how, and by who, and in what manner, their
         | content gets accessed?
         | 
         | That's how it currently works in the physical world. It doesn't
         | work like that in the digital world due to technical
         | limitations (which is a different topic, and for the record I
         | am fine with those technical limitations as they protect other
         | more important rights).
         | 
         | And since the content owner is, by definition, the owner of the
         | content in question, it feels like their rights take
         | precedence. If you don't agree with their offering (i.e. their
         | terms of service), then as an end user you don't engage, and
         | you don't access the content.
         | 
         | It really can be that simple. It's only "difficult to solve" if
         | you don't believe a content owner's rights are as valid as your
         | own.
        
           | cutemonster wrote:
           | If there's an article you want to read, and the ToS says that
           | in between reading each paragraph, you must switch to their
           | YouTube channel and look at their ads about cat food for 5
           | minutes, are your going to do that?
        
             | JimDabell wrote:
             | Hacker News has collectively answered this question by
             | consistently voting up the archive.is links in the comments
             | of every paywalled article posted here.
        
               | giantrobot wrote:
               | New sites have collectively decided to require people use
               | those services because they can't fathom _not_
               | enshittifying everything until it 's an unusable
               | transaction hellscape.
               | 
               | I never really minded magazine ads or even television
               | ads. They might have tried to make me associate boobs
               | with a brand of soda but they didn't data mine my life
               | and track me everywhere. I'd much rather have old
               | fashioned manipulation than pervasive and dangerous
               | surveillance capitalism.
        
           | gruez wrote:
           | >Or is there a balance between the owner's rights, who bears
           | the content production and hosting/serving costs, and the
           | rights of the end user who wishes to benefit from that
           | content?
           | 
           | If you believe in this principle, fair enough, but are you
           | going to apply this consistently? If it's fair game for a
           | blog to restrict access to AI agents, what does that mean for
           | other user agents that companies disagree with, like browsers
           | with adblock? Does it just boil down to "it's okay if a
           | person does it but not okay if a big evil corporation does
           | it?"
        
           | hansvm wrote:
           | It doesn't work like that in the physical world though. Once
           | you've bought a book the author can't stipulate that you're
           | only allowed to read it with a video ad in the sidebar, by
           | drinking a can of coke before each chapter, or by giving them
           | permission to sniff through your family's medical history.
           | They can't keep you from loaning it out for other people to
           | read, even thousands of other people. They can't stop you
           | from reading it in a certain room or with your favorite music
           | playing. You can even have an LLM transcribe or summarize it
           | for you for personal use (not everyone has those automatic
           | page flipping machines, but hypothetically).
           | 
           | The reason people are up in arms is because rights they
           | previously enjoyed are being stripped away by the current
           | platforms. The content owner's rights aren't as valid as my
           | own in the current world; they trump mine 10 to 1. If I "buy"
           | a song and the content owner decides that my country is
           | politically unfriendly, they just delete it and don't refund
           | me. If I request to view their content and they start by
           | wasting my bandwidth sending me an ad I haven't consented to,
           | how can I even "not engage"? The damage is done, and there's
           | no recourse.
        
         | jasonjmcghee wrote:
         | I think it's an issue of scale.
         | 
         | The next step in your progression here might be:
         | 
         | If / when people have personal research bots that go and look
         | for answers across a number of sites, requesting many pages
         | much faster than humans do - what's the tipping point? Is
         | personal web crawling ok? What if it gets a bit smarter and
         | tried to anticipate what you'll ask and does a bunch of
         | crawling to gather information regularly to try to stay up to
         | date on things (from your machine)? Or is it when you tip the
         | scale further and do general / mass crawling for many users to
         | consume that it becomes a problem?
        
           | cj wrote:
           | Doesn't o3 sort of already do this? Whenever I ask it
           | something, it makes it look like it simultaneously opens 3-8
           | pages (something a human can't do).
           | 
           | Seems like a reasonable stance would be something like
           | "Following the no crawl directive is especially necessary
           | when navigating websites faster than humans can."
           | 
           | > What if it gets a bit smarter and tried to anticipate what
           | you'll ask and does a bunch of crawling to gather information
           | regularly to try to stay up to date on things (from your
           | machine)?
           | 
           | To be fair, Google Chrome already (somewhat) does this by
           | preloading links it thinks you might click, before you click
           | it.
           | 
           | But your point is still valid. We tolerate it because as
           | website owners, we want our sites to load fast for users. But
           | if we're just serving pages to robots and the data is
           | repackaged to users without citing the original source, then
           | yea... let's rethink that.
        
             | Spivak wrote:
             | You don't middle click a bunch of links when doing
             | research? Of all the things to point to I wouldn't have
             | thought "opens a bunch of tabs" to be one of the
             | differentiating behaviors between browsing with Firefox and
             | browsing with an LLM.
        
             | fauigerzigerk wrote:
             | _> Doesn't o3 sort of already do this?_
             | 
             | ChatGPT probably uses a cache though. Theoretically, the
             | average load on the original sites could be far less than
             | users accessing them directly.
        
           | fxtentacle wrote:
           | Maybe we should just institutionalize and explicitly legalize
           | the Internet Archive and Archive Team. Then, I can download a
           | complete and halfway current crawl of domain X from the IA
           | and that way, no additional costs are incurred for domain X.
           | 
           | But of course, most website publishers would hate that.
           | Because they don't want people to access their content, they
           | want people to look at the ads that pay them. That's why to
           | them, the IA crawling their website is akin to stealing.
           | Because it's taking away some of their ad impressions.
        
             | ivape wrote:
             | Or websites can monetize their data via paid apis and
             | downloadable archives. That's what makes Reddit the most
             | valuable data trove for regular users.
        
               | ccgreg wrote:
               | I don't think Reddit pays the people who voluntarily
               | write Reddit content. Valuable to Reddit, I guess.
        
             | palmfacehn wrote:
             | https://commoncrawl.org/
             | 
             | >Common Crawl maintains a free, open repository of web
             | crawl data that can be used by anyone.
        
             | stanmancan wrote:
             | I have mixed feelings on this.
             | 
             | Many websites (especially the bigger ones) are just
             | businesses. They pay people to produce content, hopefully
             | make enough ad revenue to make a profit, and repeat.
             | Anything that reproduces their content and steals their
             | views has a direct effect on their income and their ability
             | to stay in business.
             | 
             | Maybe IA should have a way for websites to register to
             | collect payment for lost views or something. I think it's
             | negligible now, there are likely no websites losing
             | _meaningful_ revenue from people using IA instead, but it
             | might be a way to get better buy in if it were
             | institutionalized.
        
               | like_any_other wrote:
               | If magazines and newspapers were once able to be funded
               | by native ads, so can websites. The spying industry
               | doesn't want you to know this, but ads work without
               | spying too - just look at all the IRL billboards still
               | around.
        
           | tr_user wrote:
           | I saw someone suggest in another post, if only one crawler
           | was visiting and scraping and everyone else reused from that
           | copy I think most websites would be ok with it. But the
           | problem is every billionaire backed startup draining your
           | resources with something similar to a DOS attack.
        
         | Spacecosmonaut wrote:
         | Regarding point 3: The problem from the perspective of websites
         | would not be any different if they had been completely ad free.
         | People would still consume LLM generated summaries because they
         | cut down clicks and eyeballing to present you information that
         | directly pertains to the promt.
         | 
         | The whole concept of a "website" will simply become niche. How
         | many zoomers still visit any but the most popular websites?
        
         | ai-christianson wrote:
         | Websites should be able to request payment. Who cares if it is
         | a human or an agent of a human if it is paying for the request?
        
           | carlosjobim wrote:
           | They are able to request payment.
        
           | sds357 wrote:
           | What if the agent is reselling the request?
        
           | adriand wrote:
           | Cloudflare launched a mechanism for this:
           | https://blog.cloudflare.com/introducing-pay-per-crawl/
        
         | pyrale wrote:
         | Because LLM companies have historically been extremely
         | disingenuous when it comes to crawling these sites.
         | 
         | Also because there is a difference between a user hitting f5 a
         | couple times and a crawler doing a couple hundred requests.
         | 
         | Also because ultimately, by intermediating the request, llm
         | companies rob website owners of a business model. A newspaper
         | may be fine letting adblockers see their article, in hopes that
         | they may eventually subscribe. When a LLM crawls the info and
         | displays it with much less visibility for the source, that hope
         | may not hold.
        
         | fluidcruft wrote:
         | In theory, couldn't the LLM access the content on your browser
         | and it's cache, rather than interacting with the website
         | directly? Browser automation directly related to user activity
         | (prefetch etc) seems qualitatively different to me. Similarly,
         | refusing to download content or modifying content after it's
         | already in my browser is also qualitatively different. That all
         | seems fair-use-y. I'm not sure there's a technical solution
         | beyond the typical cat/mouse wars... but there is a smell when
         | a datacenter pretends to be a person. That's not a browser.
         | 
         | It could be a personal knowledge management system, but it
         | seems like knowledge management systems should be operating off
         | of things you already have. The research library down the
         | street isn't considered a "personal knowledge management
         | system" in any sense of the term if you know what I mean. If
         | you dispatch an army of minions to take notes on the library's
         | contents that doesn't seem personal. Similarly if you dispatch
         | the army of minions to a bookstore rather than a library. At
         | the very least bring the item into your house/office first.
         | (Libraries are a little different because they are designed for
         | studying and taking notes, it's use of an army of minions
         | aspect)
        
           | freehorse wrote:
           | > couldn't the LLM access the content on your browser
           | 
           | Yes, orbit, a now deprecated firefox extension by mozilla was
           | doing that. This way you could also use it to summarise
           | content that would not be available to a third party (eg sth
           | in google docs).
           | 
           | You can still sort of do the same with the ai chatbot panel
           | in firefox, sort of, but ctrl+A>right click>AI
           | chatbot>summarise.
        
         | beardyw wrote:
         | You speak as 1% of the population to 1% of the population.
         | Don't fool yourself.
        
         | porridgeraisin wrote:
         | I don't think people have a problem with an LLM issuing GET
         | website.com and then summarising that, each and every time it
         | uses that information (or atleast, save a citation to it and
         | refer to that citation). Except ad ecosystem, ignoring them for
         | now, please refer to last paragraph.
         | 
         | The problem is with the LLM then training on that data _once_
         | and then storing it forever and regurgitating it N times in the
         | future without ever crediting the original author.
         | 
         | So far, humans themselves did this, but only for relatively
         | simple information (ratio of rice and water in specific
         | $recipe). You're not gonna send a link to your friend just to
         | see the ratio, you probably remember it off the top of your
         | head.
         | 
         | Unfortunately, the top of an LLMs head is pretty big, and they
         | are fitting almost the entire website's content in there for
         | most websites.
         | 
         | The threshold beyond which it becomes irreproducible for human
         | consumers, and therefore, copyrightable (lot of copyright law
         | has "reasonable" term which refers to this same concept) has
         | now shifted up many many times higher.
         | 
         | Now, IMO:
         | 
         | So far, for stuff that won't fit in someone's head, people were
         | using citations (academia, for example). LLMs should also use
         | citations. That solves the ethical problem pretty much. That
         | the ad ecosystem chose views as the monetisation point and is
         | thus hurt by this is not anyone else's problem. The ad
         | ecosystem can innovate and adjust to the new reality in their
         | own time and with their own effort. I promise most people won't
         | be waiting. Maybe google can charge per LLM citation. Cost Per
         | Citation, you even maintain the acronym :)
        
           | skydhash wrote:
           | That's why websites have no issues with googlebot and the
           | search results. It's a giant index and citation list. But
           | stripping works from its context and presenting as your own
           | is decried throughout history.
        
           | wulfstan wrote:
           | Yes, this is the crux of the matter.
           | 
           | The "social contract" that has been established over the last
           | 25+ years is that site owners don't mind their site being
           | crawled reasonably provided that the indexing that results
           | from it links back to their content. So when
           | AltaVista/Yahoo/Google do it and then score and list your
           | website, interspersing that with a few ads, then it's a
           | sensible quid pro quo for everyone.
           | 
           | LLM AI outfits are abusing this social contract by stuffing
           | the crawled data into their models,
           | summarising/remixing/learning from this content, claiming
           | "fair use" and then not providing the quid pro quo back to
           | the originating data. This is quite likely terminal for many
           | content-oriented businesses, which ironically means it will
           | also be terminal for those who will ultimately depend on
           | additions, changes and corrections to that content - LLM AI
           | outfits.
           | 
           | IMO: copyright law needs an update to mandate no training on
           | content without explicit permission from the holder of the
           | copyright of that content. And perhaps, as others have
           | pointed out, an llms.txt to augment robots.txt that covers
           | this for llm digestion purposes.
           | 
           | EDIT: Apparently llms.txt has been suggested, but from what I
           | can tell this isn't about restricting access:
           | https://llmstxt.org/
        
             | giantrobot wrote:
             | > LLM AI outfits are abusing this social contract by
             | stuffing the crawled data into their models,
             | summarising/remixing/learning from this content
             | 
             | Let's be real, Google et al have been doing this for years
             | with their quick answer and info boxes. AI chatbots are
             | _worse_ but it 's not like the big search engines were
             | great before AI came along. Google had made itself the one-
             | stop shop for a huge percentage of users. They paid
             | billions to be the default search engine on Apple's
             | platforms not out of the goodness of their hearts but to be
             | the main destination for everyone on the web.
        
           | nelblu wrote:
           | > LLMs should also use citations.
           | 
           | Mojeek LLM (https://www.mojeek.com) uses citations.
        
         | itsdesmond wrote:
         | Some stores do not welcome Instacart or Postmates shoppers. You
         | can shop there. You can shop with your phone out, scanning
         | every item to price match, something that some bookstores frown
         | on, for example. Third party services cannot send employees to
         | index their inventory, nor can they be dispatched to pick up an
         | item you order online.
         | 
         | Their reasons vary. Some don't want their businesses perception
         | of quality to be taken out of their control (delivering cold
         | food, marking up items, poor substitutions). Some would prefer
         | their staff service and build relationships with customers
         | directly, instead of disinterested and frequently quite
         | demanding runners. Some just straight up disagree with the
         | practice of third party delivery.
         | 
         | I think that it's pretty unambiguously reasonable to choose to
         | not allow an unrelated business to operate inside of your
         | physical storefront. I also think that maps onto digital
         | services.
        
           | rjbwork wrote:
           | But I can send my personal shopper and you'll be none the
           | wiser.
        
             | bradleyjg wrote:
             | It's possible to violate all sorts of social norms.
             | Societies that celebrate people that do so are on the far
             | opposite end of the spectrum from high trust ones. They are
             | rather unpleasant.
        
               | ToucanLoucan wrote:
               | Just the Silicon Valley ethos extended to it's logical
               | conclusions. These companies take advantage of public
               | space, utilities and goodwill at industrial scale to
               | "move fast and break things" and then everyone else has
               | to deal with the ensuing consequences. Like how cities
               | are awash in those fucking electric scooters now.
               | 
               | Mind you I'm not saying electric scooters are a bad idea,
               | I have one and I quite enjoy it. I'm saying we didn't
               | need five fucking startups all competing to provide them
               | at the lowest cost possible just for 2/3s of them to end
               | up in fucking landfills when the VC funding ran out.
        
               | SoftTalker wrote:
               | My city impounded them and made them pay a fee to get
               | them back. Now they have to pay a fee every year to be
               | able to operate. Win/win.
        
               | pixl97 wrote:
               | Oh, this is a bunch of baloney.
               | 
               | What you've pretty much stated is "You must go to the
               | shops yourself so the ads and marketing can completely
               | permeate your soul, and turn you into a voracious
               | consumer.
               | 
               | Businesses have the right to fuck completely and totally
               | off a cliff taking their investor class with them in to
               | the pit of the void. They lear at us from high places
               | spending countless dollars on new ways to tell us we
               | aren't good enough.
        
               | Workaccount2 wrote:
               | The only thing consumers have to do to get rid of ads
               | permeating everything is pay for services in full
               | directly. But they won't do that, because the only thing
               | they hate more than ads, is paying with money instead.
        
             | itsdesmond wrote:
             | [flagged]
        
               | dang wrote:
               | Whoa, please don't post like this. We end up banning
               | accounts that do.
               | 
               | https://news.ycombinator.com/newsguidelines.html
        
               | itsdesmond wrote:
               | Aw, alright. I thought it was a funny way to make the
               | point and I figured the yo momma structure was
               | traditional enough to not be taken as a proper insult.
               | Heard tho.
        
             | rapind wrote:
             | It's all about scale. The impact of your personal shopper
             | is insignificant unless you manage to scale it up into a
             | business where everyone has a personal shopper by default.
        
               | mbrumlow wrote:
               | Well then. Seems like you would be a fool to not allow
               | personal shoppers then.
               | 
               | The point is the web is changing, and people use a
               | different type of browser now. Ans that browser happens
               | to be LLMs.
               | 
               | Anybody complaining about the new browser has just not
               | got it yet, or has and is trying to keep things the old
               | way because they don't know how or won't change with the
               | times. We have seen it before, Kodak, blockbuster,
               | whatever.
               | 
               | Grow up cloud flare, some is your business models don't
               | make sense any more.
        
               | ToucanLoucan wrote:
               | > Anybody complaining about the new browser has just not
               | got it yet, or has and is trying to keep things the old
               | way because they don't know how or won't change with the
               | times. We have seen it before, Kodak, blockbuster,
               | whatever.
               | 
               | You say this as though all LLM/otherwise automated
               | traffic is for the purposes of fulfilling a request made
               | by a user 100% of the time which is just flatly on-its-
               | face untrue.
               | 
               | Companies make vast amounts of requests for indexing
               | purposes. That _could_ be to facilitate user requests
               | _someday,_ perhaps, but it is not today and not why it 's
               | happening. And worse still, LLMs introduce a new third
               | option: that it's not for indexing or for later linking
               | but is instead either for training the language model
               | itself, or for the model to ingest and regurgitate later
               | on with no attribution, with the added fun that it might
               | just make some shit up about whatever you said and be
               | wrong. And as the person buying the web hosting, _all of
               | that is subsidized by me._
               | 
               | "The web is changing" does not mean every website must
               | follow suit. Since I built my blog about 2 internet
               | eternities ago, I have seen fad tech come and fad tech
               | go. My blog remains more or less exactly what it was 2
               | decades ago, with more content and a better stylesheet. I
               | have requested in my robots.txt that my content not be
               | used for LLM training, and I fully expect that to be
               | ignored because tech bros don't respect anyone, even
               | fellow tech bros, when it means they have to change their
               | behavior.
        
               | Imustaskforhelp wrote:
               | Tech bros just respect money. Making money is very easy
               | in the short term if you don't show ethics. Venture
               | capitalism and the whole growth/indie hacking is focused
               | around making money and making it fast.
               | 
               | Its a clear road for disaster. I am honestly surprised by
               | how great Hackernews is, to that comparison where most
               | people are sharing it for the love of the craft as an
               | example. And for that hackernews holds a special place in
               | my heart. (Slightly exaggerating to give it a thematic
               | ending I suppose)
        
               | julkali wrote:
               | Do not conflate your own experience with everyone else's.
        
               | goatlover wrote:
               | Some people use LLMs to search. Other people still prefer
               | going to the actual websites. I'm not going to use an LLM
               | to give me a list of the latest HN posts or NY Times
               | articles, for example.
        
               | nickthegreek wrote:
               | How is everyone having a personal shopper a problem of
               | scale? I was going to shop myself, but I sent someone
               | else to do it for me.
               | 
               | At this moment I am using Perplexity's Comet browser to
               | take a spotify playlist and add all the tracks to my
               | youtube music playlist. I love it.
        
               | rapind wrote:
               | I didn't use the word "problem". In fact I presented no
               | opinion at all. I'm just pointing out that scale matters
               | a lot. In fact, in tech, it's often the only thing that
               | matters. It's naive (or narrative) to think it doesn't.
               | 
               | Everyone having a personal shopper obviously changes the
               | relationship to the products and services you use or
               | purchase via personal shopper. Good, bad, whatever.
        
               | SoftTalker wrote:
               | We'll see more of this sort of thing as AI agents become
               | more popular and capable. They will do things that the
               | site or app should be able to do (or rather, things that
               | _users want to be able to do_ ) but don't offer. The
               | YouTube music playlist is a good example. One thing I'd
               | like to be able to do is make a playlist of some specific
               | artists. But you can't. You have to select specific
               | songs.
               | 
               | If sites want to avoid people using agents, they should
               | offer the functionality that people are using the agents
               | to accomplish.
        
               | dylan604 wrote:
               | Let's look at the opposite benefit to a store if a mom
               | that would need to bring her 3 kids to the store vs that
               | mom having a personal shopper. In this case, the personal
               | shopper is "better" for the store as far as physical
               | space. However, I'm sure the store would still rather
               | have the mom and 3 kids physically in the store so that
               | the kids can nag mom into buying unneeded items that are
               | placed specifically to attract those kids' attention.
        
               | pixl97 wrote:
               | >o that the kids can nag mom into buying unneeded items
               | 
               | Excellent. Personal shoppers are 'adblock for IRL'.
               | 
               | >You owe the companies nothing. You especially don't owe
               | them any courtesy. They have re-arranged the world to put
               | themselves in front of you. They never asked for your
               | permission, don't even start asking for theirs.
        
             | 542354234235 wrote:
             | True, and I would ask, what is your point? Is it that no
             | rule can have 100% perfect enforcement? That all rules have
             | a grey area if you look close enough? Was it just a
             | "gotcha" statement meant to insinuate what the prior
             | commenter said was invalid?
        
             | Polizeiposaune wrote:
             | To stretch the analogy to the breaking point: If you send
             | 10,000 personal shoppers all at once to the same store just
             | to check prices, the store's going to be rightfully annoyed
             | that they aren't making sales because legit buyers can't
             | get in.
        
               | sublinear wrote:
               | Too bad. Build a bigger store or publish this information
               | so we don't need 10,000 personal shoppers. Was this not
               | the whole point of having a website? Who distorted that
               | simple idea into the garbage websites we have now?
        
               | recursive wrote:
               | Weird take. The store doesn't owe your personal shippers
               | anything.
        
               | the_real_cher wrote:
               | In the same token the personal shoppers don't owe the
               | store anything either.
        
               | eddythompson80 wrote:
               | Surely they owe them money for the goods and service, no?
               | I thought that's how stores worked.
        
               | the_real_cher wrote:
               | Context friend. This article and entire comments sections
               | is about questionable web page access. Context.
        
               | eddythompson80 wrote:
               | You're replying in a store metaphor thread though.
               | Context matters.
        
               | recursive wrote:
               | Then they can't complain if they're barred entry.
        
               | the_real_cher wrote:
               | http is neutral. it's up to the client to ignore
               | robots.txt
               | 
               | You can block IP's at the host level but there's pretty
               | easy ways around that with proxy networks.
        
               | eddythompson80 wrote:
               | > http is neutral.
               | 
               | Who misled you with that statement?
        
               | the_real_cher wrote:
               | Http doesnt have emotions or thought last time I checked.
        
               | eddythompson80 wrote:
               | It seems that a 403 makes you sad though.
        
               | the_real_cher wrote:
               | iproyal.com makes me smile again
        
               | drdaeman wrote:
               | IETF?
        
               | drdaeman wrote:
               | That's fair, but if there's enough of supply and demand
               | for this to get traction (and online shopping is bug, and
               | autonomous agents are sort of trending), this conflict of
               | interest paired with a no-compromise "we don't own you
               | anything" attitude is bound to escalate in an arms race.
               | And YMMV but I don't like where that race may possibly
               | end.
               | 
               | If store businesses at least partially relies on
               | obscurity of information that can be solved through
               | automated means (e.g. storefronts tend to push visitors
               | towards products they don't want, and buyer agents are
               | fighting that and looking for something buyers instructed
               | them) just playing this cat and mouse game of blocking
               | agents, finding workarounds, and repeating the cycle is
               | only creating perverse technological contraptions that
               | neither party is really interested in - but both are
               | circumstantially forced to invest into.
        
               | dabockster wrote:
               | > Who distorted that simple idea into the garbage
               | websites we have now?
               | 
               | Corporate America. Where clean code goes to die.
        
               | hombre_fatal wrote:
               | Your comment and the above comment of course show
               | different cases.
               | 
               | An agent making a request on the explicit behalf of
               | someone else is probably something most of us agree is
               | reasonable. "What are the current stories on Hacker
               | News?" -- the agent is just doing the same request to the
               | same website that I would have done anyways.
               | 
               | But the sort of non-explicit just-in-case crawling that
               | Perplexity might do for a general question where it
               | crawls 4-6 sources isn't as easy to defend. "Are polar
               | bears always white?" -- Now it's making requests I
               | wouldn't have necessarily made, and it could even been
               | seen as a sort of amplification attack.
               | 
               | That said, TFA's example is where they register
               | secretexample.com and then ask Perplexity "what is
               | secretexample.com about?" and Perplexity sends a request
               | to answer the question, so that's an example of the first
               | case, not the second.
        
               | bayindirh wrote:
               | As a person who has a couple of sites out there, and
               | witnesses AI crawlers coming and fetching pages from
               | these sites, I have a question:
               | 
               | What prevents these companies from keeping a copy of that
               | particular page, which I specifically disallowed for bot
               | scraping, and feed it to their next training cycle?
               | 
               | Pinky promises? Ethics? Laws? Technical limitations?
               | Leeroy Jenkins?
        
               | tempfile wrote:
               | The way to prevent people from downloading your pages and
               | using them is to take them off the public internet. There
               | are laws to prevent people from violating your copyright
               | or from preventing access to your service (by excessive
               | traffic). But there is (thankfully) no magical right that
               | stops people from reading your content and describing it.
        
               | bayindirh wrote:
               | Many site operators want people to access their content,
               | but prevent AI companies from scraping their sites for
               | training data. People who think like that made tools like
               | Anubis, and it works.
               | 
               | I also want to keep this distinction on the sites I own.
               | I also use licenses to signal that this site is not good
               | to use for AI training, because it's CC BY-NC-SA-2.0.
               | 
               | So, I license my content appropriately (No derivative,
               | Non-commercial, shareable with the same license with
               | attribution), add technical countermeasures on top,
               | because companies doesn't _respect_ these licenses
               | (because monies), and circumvent these mechanisms
               | (because monies), and I 'm the one to suck this up and
               | shut-up (because _their_ monies)?
               | 
               | Makes no sense whatsoever.
        
               | hombre_fatal wrote:
               | I guess that's a question that might be answered by the
               | NYT vs OpenAI lawsuit at least on the enforceability of
               | copyright claims if you're a corporation like NYT.
               | 
               | If you don't have the funds to sue an AI corp, I'd
               | probably think of a plan B. Maybe poison the data for
               | unauthenticated users. Or embrace the inevitability. Or
               | see the bright side of getting embedded in models as if
               | you're leaving your mark.
        
               | tempfile wrote:
               | Of course some people want that. And at the moment they
               | can prevent it. But those methods may stop working. Will
               | it then be alright to do it? Of course not, so why bother
               | mentioning that they are able to prevent it now - just
               | give a justification.
               | 
               | Your license is probably not relevant. I can go to the
               | cinema and watch a movie, then come on this website and
               | describe the whole plot. That isn't copyright
               | infringement. Even if I told it to the whole world, it
               | wouldn't be copyright infringement. Probably the movie
               | seller would prefer it if I didn't tell anyone. Why
               | should I care?
               | 
               | I actually agree that AI companies are generally bad and
               | should be stopped - because they use an exorbitant amount
               | of bandwidth and harm the services for other users. At
               | least they should be heavily taxed. I don't even begrudge
               | people for using Anubis, at least in some cases. But it
               | is wrong-headed (and actually wrong in fact) to try to
               | say someone may or may not use my content for some
               | purpose because it hurts my feelings or it messes with my
               | ad revenue. We have laws against copyright infringement,
               | and to prevent service disruption. We should not have
               | laws that say, yes you can read my site but no you can't
               | use it to train an LLM, or to build a search index. That
               | would be unethical. Call for a windfall tax if they piss
               | you off so much.
        
               | accrual wrote:
               | Thanks for sharing your experience. A little off-topic
               | but I'd like to start hosting some personal content,
               | guides/tutorials, etc.
               | 
               | Do you still see authentic human traffic on your domains,
               | is it easy to discern?
               | 
               | I feel like I missed the bus on running a blog pre-AI.
        
             | ghurtado wrote:
             | Sure. There's lots of things you _could_ do, but you don 't
             | do them because they are wrong.
             | 
             | Might does not make right.
        
               | rjbwork wrote:
               | How is it wrong to send my personal shopper? How is it
               | wrong to have an agent act directly on my behalf?
               | 
               | It's like saying a web browser that is customized in any
               | way is wrong. If one configures their browser to eagerly
               | load links so that their next click is instant, is that
               | now wrong?
        
               | ghurtado wrote:
               | Here's a good rule of thumb: if you have to do it without
               | other people knowing, because otherwise they wouldn't let
               | you do it: chances are it's a bad thing to do.
        
             | fireflash38 wrote:
             | And you can be trespassed and prosecuted if you continue to
             | violate.
        
           | cma wrote:
           | These are more like a store putting up a billboard or catalog
           | and asking people to turn off their meta AI glasses nearby
           | because the store doesn't want AI translating it on your
           | behalf as a tourist.
        
             | itsdesmond wrote:
             | It is not because the store does not expend any resources
             | on the singular instance of the glasses capturing the
             | content of the billboard. Web requests cost money.
        
         | GardenLetter27 wrote:
         | And isn't the obvious solution to just make some sort of
         | browsers add-on for the LLM summary so the request comes from
         | your browser and then gets sent to the LLM?
         | 
         | I think the main concern here is the huge amount of traffic
         | from crawling just for content for pre-training.
        
           | otterley wrote:
           | Why would a personal browser have to crawl fewer pages than
           | the agent's mechanism? If anything, the agent would be more
           | efficient because it could cache the content for others to
           | use. In the situation we're talking about, the AI engine is
           | behaving essentially like a caching proxy--just like a CDN.
        
         | shadowgovt wrote:
         | Not only is it difficult to solve, it's the next step in the
         | process of harvesting content to train AIs: companies will pay
         | humans (probably in some flavor of "company scrip," such as
         | extra queries on their AI engine) to install a browser
         | extension that will piggy-back on their human access to sites
         | and scrape the data from their human-controlled client.
         | 
         | At the limit, this problem is the problem of "keeping secrets
         | while not keeping secrets" and is unsolvable. If you've shared
         | your site content to _one_ entity you cannot control, you
         | cannot control where your site content goes from there
         | (technologically; the law is a different question).
        
           | quectophoton wrote:
           | > companies will pay humans (probably in some flavor of
           | "company scrip," such as extra queries on their AI engine) to
           | install a browser extension that will piggy-back on their
           | human access to sites and scrape the data from their human-
           | controlled client.
           | 
           | Proprietary web browsers are in a really good position to do
           | something like this, especially if they offer a free VPN. The
           | browser would connect to the "VPN servers", but it would be
           | just to signal that this browser instance has an internet
           | connection, while the requests are just proxied through
           | another browser user.
           | 
           | That way the company that owns this browser gets a free
           | network of residential IP address ready to make requests (in
           | background) using a real web browser instance. If one of
           | those background requests requires a CAPTCHA, they can just
           | show it to the real user, e.g. the real user visits a Google
           | page and they see a Cloudflare CAPTCHA, but that CAPTCHA is
           | actually from one of the background requests (while lying in
           | its UI and still showing the user a Google URL in the address
           | bar).
        
         | danieldk wrote:
         | There are also a gazillion pages that are not ad-riddled
         | content. With search engines, the implicit contract was that
         | they could crawl pages because they would drive traffic to the
         | websites that are crawled.
         | 
         | AI crawlers for non-open models void the implicit contract.
         | First they crawl the data to build a model that can do QA.
         | Proprietary LLM companies earn billions with knowledge that was
         | crawled from websites and websites don't get anything in
         | return. Fetching for user requests (to feed to an LLM) is kind
         | of similar - the LLM provider makes a large profit and the
         | author that actually put in time to create the content does not
         | even get a visit anymore.
         | 
         | Besides that, if Perplexity is fine with evading robots.txt and
         | blocks for user requests, how can one expect them not to use
         | the fetched pages to train/finetine LLMs (as a side channel
         | when people block crawling for training).
        
         | Tuna-Fish wrote:
         | I would not mind 3, so long as it's just the LLM processing the
         | website inside its context window, and no information from the
         | website ends up in the weights of the model.
        
         | talos_ wrote:
         | This analogy doesn't map to the actual problem here.
         | 
         | Perplexity is not visiting a website everytime a user asks
         | about it. It's frequently crawling and indexing the web, thus
         | redirecting traffic away from websites.
         | 
         | This crawling reduces costs and improves latency for Perplexity
         | and its users. But it's a major threat to crawled websites
        
           | shadowgovt wrote:
           | I have never created a website that I would not mind being
           | fully crawled and indexed into another dataset that was
           | divorced from the source (other than such divorcement makes
           | it much harder to check pedigree, which is an academic
           | concern, not a data-content concern: if people want to trust
           | information from sources they can't know and they can't
           | verify I can't fix that for them).
           | 
           | In fact, the "old web" people sometimes pine for was _mostly_
           | a place where people were putting things online so they were
           | online, not because it would translate directly to money.
           | 
           | Perhaps AI crawlers are a harbinger for the death of the web
           | 2.0 pay-for-info model... And perhaps that's okay.
        
             | short_sells_poo wrote:
             | There's an important distinction that we are glossing over
             | I think. In the times of the "old web", people were putting
             | things online to interact with a (large) online audience.
             | If people found your content interesting, they'd keep
             | coming back and some of them would email you, there'd be
             | discussions on forums, IRC chatrooms, mailing lists, etc.
             | Communities were built around interesting topics, and
             | websites that started out as just some personal blog that
             | someone used to write down their thoughts would grow into
             | fonts of information for a large number of people.
             | 
             | Then came the social networks and walled gardens, SEO, and
             | all the other cancer of the last 20 years and all of these
             | disappeared for un-searchable videos, content farms and
             | discord communities which are basically informational black
             | holes.
             | 
             | And now AI is eating that cancer, but IMO it's just one
             | cancer being replaced by an even more insidious cancer. If
             | all the information is accessed via AI, then the last
             | semblance of interaction between content creators and
             | content consumers disappears. There are no more
             | communities, just disconnected consumers interacting with a
             | massive aggregating AI.
             | 
             | Instead of discussing an interesting topic with a human, we
             | will discuss with AI...
        
         | troyvit wrote:
         | > If I now go one step further and use an LLM to summarize
         | content because the authentic presentation is so riddled with
         | ads, JavaScript, and pop-ups, that the content becomes
         | borderline unusable, then why would the LLM accessing the
         | website on my behalf be in a different legal category as my
         | Firefox web browser accessing the website on my behalf?
         | 
         | I think one thing to ask outside of this question is how long
         | before your LLM summaries don't also include ads and other
         | manipulative patterns.
        
         | Neil44 wrote:
         | Flip it around, why would you go to the trouble of creating a
         | web page and content for it, if some AI bot is going to scrape
         | it and save people the trouble of visiting your site? The value
         | of your work has been captured by some AI company (by somewhat
         | nefarious means too).
        
         | carlosjobim wrote:
         | Legal category?
        
         | renewiltord wrote:
         | The websites don't nag you, actually. They just send you data.
         | You have configured your user agent to nag yourself when the
         | website sends you data.
         | 
         | And you're right: there's no difference. The web is just
         | machines sending each other data. That's why it's so funny that
         | people panic about "privacy violations" and server operators
         | "spying on you".
         | 
         | We're just sending data around. Don't send the data you don't
         | want to send. If you literally send the data to another machine
         | it might save it. If you don't, it can't. The data the website
         | operator sends you might change as a result but it's just data.
         | And a free interaction between machines.
        
         | baxuz wrote:
         | 1. To access a website you need a limited anonymized token that
         | proves you are a human being, issued by a state authority
         | 
         | 2. the end
         | 
         | I am firmly convinced that this should be the future in the
         | next decade, since the internet as we know it has been
         | weaponized and ruined by social media, bots, state actors and
         | now AI.
         | 
         | There should exist an internet for humans only, with a single
         | account per domain.
        
           | glenstein wrote:
           | A fascinating variation on this same issue can be found in
           | Neal Stephenson's "Fall, or Dodge in Hell". There the
           | solution is (1) discredit weaponized social media in its
           | entirety by amplifying it's output exponentially and make its
           | hostility universal in all directions, to the point that it's
           | recognizeable as bad faith caricature. That way it can't be
           | strategically leveraged with disproportionate directional
           | focus against strategic targets by bad actors and (2) a new
           | standard called PURDA, which is kind of behavioral signature
           | as the mark of unique identity.
        
         | dawnerd wrote:
         | Nothing wrong if they fetch on your behalf. The problem is when
         | they endlessly crawl along with every other ai company doing
         | the same.
        
         | epolanski wrote:
         | It's somebody's else content and resources and they are free to
         | ban you or your bots as much as they please.
        
         | EGreg wrote:
         | 1. I actually disagree. I think teasers should be free but
         | websites should charge micropayments for their content. Here is
         | how it can be done seamlessly, without individuals making
         | decisions to pay every minute: https://qbix.com/ecosystem
         | 
         | 2. This also intersects with copyright law. Ingesting content
         | to your servers en masse through automation and transforming it
         | there is not the same as giving people a tool (like Safari
         | Reader) they can run on their client for specific sites they
         | visit. Examples of companies that lost court cases about this:
         | Aereo, Inc. v. American Broadcasting Companies (2014)
         | TVEyes, Inc. v. Fox News Network, LLC (2018)       UMG
         | Recordings, Inc. v. MP3.com, Inc. (2000)       Capitol Records,
         | LLC v. ReDigi Inc. (2018)       Cartoon Network v. CSC Holdings
         | (Cablevision) (2008)       Image Search Engines: Perfect 10 v.
         | Google (2007)
         | 
         | That last one is very instructive. Caching thumbnails and
         | previews may be OK. The rest is not. AMP is in a copyright grey
         | area, because publishers _choose_ to make their content
         | available for AMP companies to redisplay. (@tptacek may have
         | more on this)
         | 
         | 3. Putting copyright law aside, that's the point.
         | Decentralization vs Centralization. If a bunch of people want
         | to come eat at an all-you-can-eat buffet, they can, because we
         | know they have limited appetites. If you bring a giant truck
         | and load up all the food from all all-you-can-eat buffets in
         | the city, that's not OK, even if you later give the food away
         | to homeless people for free. You're going to bankrupt the
         | restaurants! https://xkcd.com/1499/
         | 
         | So no. The difference is that people have come to expect "free"
         | for everything, and this is how we got into ad-supported
         | platforms that dominate our lives.
        
           | glenstein wrote:
           | I would love micropayments as a kind of baked-in ecosystem
           | support. You can crawl if you want, but it's pay to play.
           | Which hopefully drives motivation for robust norms for
           | content access and content scraping that makes everyone
           | happy.
        
             | EGreg wrote:
             | I want to bring Ted Nelson on my channel and interview him
             | about Xanadu. Does anyone here know him?
             | 
             | https://xanadu.com.au/ted/XU/XuPageKeio.html
        
           | jacurtis wrote:
           | I think this is the world we are going to. I'm not going to
           | get mired in the details of how it would happen, but I see
           | this end result as inevitable (and we are already moving that
           | way).
           | 
           | I expect a lot more paywalls for valuable content. General
           | information is commoditized and offered in aggregated form
           | through models. But when an AI is fetching information for
           | you from a website, the publisher is still paying the cost of
           | producing that content and hosting that content. The AI
           | models are increasing the cost of hosting the content and
           | then they are also removing the value of producing the
           | content since you are just essentially offering value to the
           | AI model. The user never sees your site.
           | 
           | I know Ads are unpopular here, but the truth is that is how
           | publishers were compensated for your attention. When an AI
           | model views the information that a publisher produces, then
           | modifies it from its published form, and removes all ad
           | content. Then you now have increased costs for producers,
           | reduced compensation in producing content (since they are not
           | getting ad traffic), and the content isn't even delivered in
           | the original form.
           | 
           | The end result is that publishers now have to paywall their
           | content.
           | 
           | Maybe an interesting middle-ground is if the AI Model
           | companies compensated for content that they access similar to
           | how Spotify compensates for plays of music. So if an AI model
           | uses information from your site, they pay that publisher a
           | fraction of a cent. People pay the AI models, and the AI
           | models distribute that to the producers of content that feed
           | and add value to the models.
        
         | gentle wrote:
         | I believe you're being disingenuous. Perplexity is running a
         | set of crawlers that do not respect robots.txt and take steps
         | to actively evade detection.
         | 
         | They are running a service and this is not a user taking steps
         | to modify their own content for their own use.
         | 
         | Perplexity is not acting as a user proxy and they need to learn
         | to stick to the rules, even when it interferes with their
         | business model.
        
         | axus wrote:
         | For 1, 2, and 3, the website owner can choose to block you
         | completely based on IP address or your User Agent. It's not
         | nice, but the best reaction would be to find another website.
         | 
         | Perplexity is choosing to come back "on a VPN" with new IP
         | addresses to evade the block.
         | 
         | #2 and #3 are about modifying data where access has been
         | granted, I think Cloudflare is really complaining about #1.
         | 
         | Evading an IP address ban doesn't violate my principles in some
         | cases, and does in others.
        
         | amiga386 wrote:
         | If you as a human are well behaved, that is absolutely fine.
         | 
         | If you as a human spam the shit out of my website and waste my
         | resources, I will block you.
         | 
         | If you as a human use an agent (or browser or extension or
         | external program) that modifies network requests on your
         | behalf, but doesn't act as a massive leech, you're still
         | welcome.
         | 
         | If you as a human use an agent (or browser or extension or
         | external program) that wrecks my website, I will block you _and
         | the agent you rode in on._
         | 
         | Nobody would mind if you had an LLM that intelligently _knew_
         | what pages contain what (because it had a web crawler backed
         | index that refreshes at a respectful rate, and identifies
         | itself accurately as a robot and follows robots.txt), and even
         | if it needed to make an instantaneous request for you at the
         | time of a pertinent query, it still identified itself as a bot
         | and was still respectful... there would be no problem.
         | 
         | The problem is that LLMs are run by stupid, greedy, evil people
         | who don't give the slightest shit what resources they use up on
         | the hosts they're sucking data from. They don't care what the
         | URLs are, what the site owner wants to keep you away from. They
         | download massive static files hundreds or thousands of times a
         | day, not even doing a HEAD to see that the file hasn't changed
         | in 12 years. They straight up ignore robots.txt and in fact use
         | it as a template of what to go for first. It's like hearing an
         | old man say "I need time to stand up because of this problem
         | with my kneecaps" and thinking "right, I best go for his
         | kneecaps because he's weak there"
         | 
         | There are plenty of open crawler datasets, they _should_ be
         | using those... but they don 't, they think that doesn't
         | differentiate them enough from others using "fresher" data, so
         | they crawl even the smallest sites dozens of times a day in
         | case those small sites got updated. Their badly written
         | software is wrecking sites, and they don't care about the
         | wreckage. Not their problem.
         | 
         | The people who run these agents, LLMs, whatever, have broken
         | every rule of decency in crawling, and they're now deliberately
         | _evading_ checks, to try and run away from the repercussions of
         | their actions. They are bad actors and need to be stopped. It
         | 's like the fuckwads who scorch the planet mining bitcoin;
         | there's so much money flowing in the market for AI, that they
         | feel they _have_ to fuck over everyone else, as soon as
         | possible, otherwise they won 't get that big flow of money.
         | They have zero ethics. They have to be stopped before _their
         | human behaviour_ destroys the entire internet.
        
         | paulcole wrote:
         | > 1. If I as a human request a website, then I should be shown
         | the content. Everyone agrees.
         | 
         | Definitely don't agree. I don't think you should be shown the
         | content, if for example:
         | 
         | 1. You're in a country the site owner doesn't want to do
         | business in.
         | 
         | 2. You've installed an ad blocker or other tool that the site
         | owner doesn't want you to use.
         | 
         | 3. The site owner has otherwise identified you as someone they
         | don't want visiting their site.
         | 
         | You are welcome to try to fool them into giving you the content
         | but it's not your right to get it.
        
         | Vegenoid wrote:
         | There is a significant distinction between 2 and 3 that you
         | glossed over. In 1 and 2, you the human may be forced to prove
         | that you are human via a captcha. You are present at the time
         | of the request. Once you've performed the exchange, then the
         | HTML is on your computer and so you can do what you want to it.
         | 
         | In 3, although you do not specify, I assume you mean that a bot
         | requests the page, as opposed to you visiting the page like in
         | scenario 2 and then an LLM processes the downloaded data
         | (similarly to an adblocker). It is the former case that is a
         | problem, the latter case is much harder to stop and there is
         | much less reason to stop it.
         | 
         | This is the distinction: is a human present at the time of
         | request.
        
           | philistine wrote:
           | To me it's even simpler: 3 is a request made from another ip
           | address that isn't directly yours. Why should an LLM request
           | that acts exactly like a VPN request be treated differently
           | from a VPN request?
        
             | freehorse wrote:
             | Yeah, I also find the analogy about "agent on behalf of the
             | user interacting with a website" weak, because it is not
             | about "an agent", it is a 3rd party service that actually
             | takes content from a website, processes it and serves it to
             | the user (even with their own ads?). It is more akin to,
             | let's say, a scammy website that copies content from other
             | legit websites and serves their own ads, than software
             | running on the user's computer.
             | 
             | There are legitimate reasons to do that, of course. Maybe I
             | am trying to find info about some niche topic or how to do
             | X, I ask an llm, the llm goes through some search results,
             | a lot of which is search engine optimised crap, finds the
             | relevant info and answers my question.
             | 
             | But if I wrote articles in a news site, I am supported by
             | ads or subscriptions and see my visits plummel because
             | people, who would usually google about topic X and then
             | visit my website that I wrote about X, were now reading the
             | google summary that appeared when googling about topic X,
             | based on my article, maybe I would have less motivation to
             | continue writing.
             | 
             | The only end result possible in such a scenario is that
             | everything commercial of some quality being heavily
             | paywalled, some tiny amount of free and open small web, and
             | a huge amount of AI generated slop, because the value of an
             | article in the open internet is now so low that only AI can
             | produce it (economically, time-wise) efficiently enough.
        
         | TZubiri wrote:
         | In that case the llm would be a user-agent, quite distinct from
         | scraping without a specific user request.
         | 
         | This is well defined in specs and ToS, not quite a gray area
        
         | dabockster wrote:
         | > If I now go one step further and use an LLM to summarize
         | content because the authentic presentation is so riddled with
         | ads, JavaScript, and pop-ups, that the content becomes
         | borderline unusable, then why would the LLM accessing the
         | website on my behalf be in a different legal category as my
         | Firefox web browser accessing the website on my behalf?
         | 
         | Because the LLM is usually on a 3rd party cloud system and
         | ultimately not under your full control. You have no idea if the
         | LLM is retaining any of that information for that business's
         | own purposes beyond what a EULA says - which basically amounts
         | to a pinky swear here. Especially if that LLM is located across
         | international borders.
         | 
         | Now, for something like Ollama or LMStudio where the LLM and
         | the whole toolchain is physically on your own system? Yeah that
         | should be like Firefox legally since it's under your control.
        
         | jpadkins wrote:
         | > 1. If I as a human request a website, then I should be shown
         | the content. Everyone agrees.
         | 
         | I disagree. The website should have the right to say that the
         | user can be shown the content under specific conditions (usage
         | terms, presented how they designed, shown with ads, etc). If
         | the software can't comply with those terms, then the human
         | shouldn't be shown the content. Both parties did not agree in
         | good faith.
        
           | dgshsg wrote:
           | You want the website to be able to force the user to see ads?
        
             | jpadkins wrote:
             | no, I think a fair + just world, both parties agree before
             | they transact. There is no force in either direction (don't
             | force creators to give their content on terms they don't
             | want, don't force users to view ads they don't want). It's
             | perfectly fine if people with strict preferences don't
             | match. It's a big web, there are plenty of creators and
             | consumers.
             | 
             | If the user doesn't want to view content with ads, that's
             | okay and they can go elsewhere.
        
         | sussmannbaka wrote:
         | 4. If I now go one step further and use a commercial DDoS
         | service to make the get requests for me because this comparison
         | is already a stretch, then why would the DDoS provider
         | accessing the website on my behalf be in a different legal
         | category as my Firefox web browser accessing the website on my
         | behalf?
        
         | pavon wrote:
         | Question from a non-web-developer. In case 3, would it be
         | technically possible for Perplexity's website to fetch the URL
         | in question using javascript in the user's browser, and then
         | send it to the server for LLM processing, rather than have the
         | server fetch it? Or do cross-site restrictions prevent
         | javascript from doing that?
        
         | jabroni_salad wrote:
         | If it was just one human requesting one summary of the page
         | nobody would ever notice. The typical watermark for junk
         | traffic is pretty high as it was.
         | 
         | I have a dinky little txt site on my email domain. There is
         | nothing of value on it, and the content changes less than once
         | a year. So why are AI scrapers hitting it to the tune of dozens
         | of GB per month?
        
         | snihalani wrote:
         | you are paying for LLM but not paying for the website. LLM is
         | removing the power the website had. Legally, that's cause for
         | loss of income
        
         | bigbuppo wrote:
         | Right, but the LLM isn't really being used for that. It's being
         | used for marketing and advertising purposes most of the time.
         | The AI companies also let you play with it from time to time so
         | you'll be a shill for them, but mostly it's the advertising
         | people you claim to not like.
        
         | RiverCrochet wrote:
         | Intellectual property laws are what creates the entitlement
         | that someone else besides you can tell you what to do with the
         | things Internet connected computers and phones download,
         | because almost everything you download is copy of something a
         | person created, therefore its copyrighted for the life of the
         | author + 75 years or whatever by default.
         | 
         | Therefore artifices like "you don't have the right to view this
         | website without ads" or "you can't use your phone, computer, or
         | LLM to download or process this outside of my terms because
         | copyright" become possible, institutionalizable, enforceable,
         | and eventually unbypassable by technology.
         | 
         | If we reverted back to the Constitutional purpose of copyright
         | (to Progress the Science and Useful Arts) then things might be
         | more free. That's probably not happening in my lifetime or
         | yours.
        
         | amelius wrote:
         | It's because they own the content so they get to set the terms.
        
         | k1m wrote:
         | When Yahoo! Pipes was still running (long time ago), their
         | official position was:
         | 
         | > Because Pipes is not a web crawler (the service only
         | retrieves URLs when requested to by a Pipe author or user)
         | Pipes does not follow the robots exclusion protocol, and won't
         | check your robots.txt file.
        
         | remus wrote:
         | The solution to 3 seems fairly straightforward: user requests
         | content and passes it to llm to summarise.
        
       | nnx wrote:
       | I do not really get why user-agent blocking measures are despised
       | for browsers but celebrated for agents?
       | 
       | It's a different UI, sure, but there should be no discrimination
       | towards it as there should be no discrimination towards, say,
       | Links terminal browser, or some exotic Firefox derivative.
        
         | ploynog wrote:
         | Being daft on purpose? I haven't heard that using an
         | alternative browser suddenly increases the traffic that a user
         | generates by several orders of magnitude to the point where it
         | can significantly increase hosting cost. A web scraper on the
         | other hand easily can and they often account for the majority
         | of traffic especially on smaller sites.
         | 
         | So your comparison is at least naive assuming good intentions
         | or malicious if not.
        
         | magicmicah85 wrote:
         | A crawler intends to scrape the content to reuse for its own
         | purposes while a browser has a human being using it. There's
         | different intents behind the tools.
        
           | JimDabell wrote:
           | Cloudflare asked Perplexity this question:
           | 
           | > Hello, would you be able to assist me in understanding this
           | website? https:// [...] .com/
           | 
           | In this case, Perplexity had a human being using it.
           | Perplexity wasn't crawling the site, Perplexity was being
           | operated by a human working for Cloudflare.
        
         | gruez wrote:
         | >I do not really get why user-agent blocking measures are
         | despised for browsers but celebrated for agents?
         | 
         | AI broke the brains of many people. The internet isn't a
         | monolith, but prior to the AI boom you'd be hard pressed to
         | find people who were pro-copyright (except maybe a few who
         | wanted to use it to force companies to comply with copyleft
         | obligations), pro user-agent restrictions, or anti-scraping.
         | Now such positions receive consistent representation in
         | discussions, and are even the predominant position in some
         | places (eg. reddit). In the past, people would invoke
         | principled justifications for why they opposed those positions,
         | like how copyright constituted an immoral monopoly and stifled
         | innovation, or how scraping was so important to
         | interoperability and the open web. Turns out for many, none of
         | those principles really mattered and they only held those
         | positions because they thought those positions would harm big
         | evil publishing/media companies (ie. symbolic politics theory).
         | When being anti-copyright or pro-scraping helped big evil AI
         | companies, they took the opposite stance.
        
           | Fraterkes wrote:
           | I think the intelligent conclusion would be that the people
           | you are looking at have more nuanced beliefs than you
           | initially thought. Talking about broken brains is often just
           | mediocre projecting
        
             | gruez wrote:
             | >I think the intelligent conclusion would be that the
             | people you are looking at have more nuanced beliefs than
             | you initially thought.
             | 
             | You don't seem to reject my claim that for many, principles
             | took a backseat to "does this help or hurt evil
             | corporations". If that's what passes as "nuance" to you,
             | then sure.
             | 
             | >Talking about broken brains is often just mediocre
             | projecting
             | 
             | To be clear, that part is metaphorical/hyperbolic and not
             | meant to be taken literally. Obviously I'm not diagnosing
             | people who switched sides with a psychiatric condition.
        
               | ipaddr wrote:
               | People never agreed DOSing a site to take copyright
               | material was acceptable. Many people did not have a
               | problem with taking copyright material in a respectful
               | way that didn't kill the resource.
               | 
               | LLMs are killing the resource. This isn't a corporation
               | vs person issue. No issue with an llm having my content
               | but big issue with my server being down because llms are
               | hammering the same page over and over.
        
               | gruez wrote:
               | >People never agreed DOSing a site to take copyright
               | material was acceptable. Many people did not have a
               | problem with taking copyright material in a respectful
               | way that didn't kill the resource.
               | 
               | Has it be shown that perplexity engages in "DOSing"? I've
               | heard of anecdotes of AI bots gone amuck, and maybe
               | that's what's happening here, but cloudflare hasn't
               | really shown that. All they did was set up a robots.txt
               | and shown that perplexity bypassed it. There's probably
               | archivers out there that's using youtube-dl to hit
               | download from youtube at 1+Gbit/s, tens of times more
               | than a typical viewer is downloading. Does that mean it's
               | fair game to point to a random instance of someone using
               | youtube-dl and characterizing that as "DOSing"?
        
               | Fraterkes wrote:
               | The guy that runs shadertoy talked about how the
               | hostingcost for his free site shot up because Openai kept
               | crawling his site for training data (ignoring robot.txt)
               | I think that's bad, and I have also experimented a bit
               | with using BeautifulSoup in the past to download ~2MB of
               | pictures from Instagram. Do you think I'm holding an
               | inconsistent position?
        
               | gruez wrote:
               | My point is that to invoke the "they're DOSing" excuse,
               | you actually have to provide evidence it's happening in
               | this specific instance, rather than vaguely gesturing at
               | some class of entities (AI companies) and concluding that
               | because some AI companies are DOSing, all AI companies
               | are DOSing. Otherwise it's like youtube blocking all
               | youtube-dl users for "DOSing" (some fraction of users
               | arguably are), and then justifying their actions with
               | "People never agreed DOSing a site to take copyright
               | material was acceptable".
        
               | Fraterkes wrote:
               | I tell you of an instance where the biggest ai company is
               | DOS'ing and your reply is that I haven't proven all of
               | them are doing it? Why do I waste my time on this stuff
        
           | 542354234235 wrote:
           | There is an expression "the dose makes the poison". With any
           | sufficiently complex or broad category situation, there is
           | rarely a binary ideological position that covers any and all
           | situations. Should drugs be legal for recreation? Well my
           | feeling for marijuana and fentanyl are different. Should
           | individuals be allowed to own weapons? My views differ
           | depending on if it a switch blade knife of a Stinger missile.
           | Can law enforcement surveille possible criminals? My views
           | differ based on whether it is a warranted wiretap or an IMSI
           | catcher used on a group of protestors.
           | 
           | People can believe that corporations are using the power
           | asymmetry between them and individuals through copywrite law
           | to stifle the individual to protect profits. People can also
           | believe that corporations are using the power asymmetry
           | between them and individuals through AI to steal intellectual
           | labor done by individuals to protect their profits. People's
           | position just might be that the law should be used to protect
           | the rights of parties when there is a large power asymmetry.
        
             | gruez wrote:
             | >There is an expression "the dose makes the poison". With
             | any sufficiently complex or broad category situation, there
             | is rarely a binary ideological position that covers any and
             | all situations. Should drugs be legal for recreation? Well
             | my feeling for marijuana and fentanyl are different. Should
             | individuals be allowed to own weapons? My views differ
             | depending on if it a switch blade knife of a Stinger
             | missile. Can law enforcement surveille possible criminals?
             | My views differ based on whether it is a warranted wiretap
             | or an IMSI catcher used on a group of protestors.
             | 
             | This seems very susceptible to manipulation to get whatever
             | conclusion you want. For instance, is dose defined? It
             | sounds like the idea you're going for is that the typical
             | pirate downloads a few dozen movies/games but AI companies
             | are doing millions/billions, but why should it be counted
             | per infringer? After all, if everyone pirates a given
             | movie, that wouldn't add up much in terms of their personal
             | count of infringements, but would make the movie
             | unprofitable.
             | 
             | >People's position just might be that the law should be
             | used to protect the rights of parties when there is a large
             | power asymmetry.
             | 
             | That sounds suspiciously close to "laws should just be
             | whatever benefits me or my group". If so, that would be a
             | sad and cynical worldview, not dissimilar to the stance on
             | free speech held by the illiberal left and right. "Free
             | speech is an important part of democracy", they say, except
             | when they see their opponents voicing "dangerous ideas", in
             | which case they think it should be clamped down. After all,
             | what are laws for if not a tool to protect the interests of
             | your side?
        
           | o11c wrote:
           | It's the hypocrisy you're seeing - why are AIs allowed to
           | profit from violating copyright, while people wanting to do
           | actually useful things have been consistently blocked? Either
           | resolution would be fine, but we can't have it both ways.
           | 
           | Regardless, the bigger AI problem is spam, and that has
           | _never_ been acceptable.
        
       | bbqfog wrote:
       | If you put info on the web, it should be available to everyone or
       | everything with access.
        
         | TechDebtDevin wrote:
         | Not according to CF. They are desperate to turn web sites into
         | newspaper dispensers, where you should give them a quarter to
         | see the content, on the basis that a bot is somehow different
         | than a normal human vistor o a legal basis. Cf has been trying
         | this psyop for years.
        
           | ectospheno wrote:
           | Sites aren't getting ad clicks for this traffic. Thus, they
           | have an incentive to do something. Cloudflare is just
           | responding to the market. Is this response bad for us in the
           | long run? Probably. Screaming about cloudflare isn't going to
           | change the market. You fix a problem with capitalism by using
           | supply and demand levers. Everything else is folly.
        
             | TechDebtDevin wrote:
             | I wonder if crawlers started letting ads through, and
             | interacting with them a bit, if these complaints would go
             | away. If we can just shaft the advertisers, maybe that will
             | solve the whole problem :)
        
         | Workaccount2 wrote:
         | What this actually translates to is "Don't bother putting much
         | effort into web content. Put effort into siloed mobile app
         | content where you get compensation".
         | 
         | People like getting money for their work. You do too. Don't
         | lose sight of that.
        
         | 9cb14c1ec0 wrote:
         | Even for AI summaries that leech off your content without
         | sending any traffic your direction?
        
         | goatlover wrote:
         | You're making a moral statement without providing a
         | justification. Why should it for everything with access?
        
       | TechDebtDevin wrote:
       | Cloudflare screaming into the void desperate to insert themselves
       | as a middleman, in a market ( that they will never succeed in
       | creating) where they extort scrapers for access to websites they
       | cover.
       | 
       | Sorry CF, give up. the courts are on our sides here
        
         | morkalork wrote:
         | Are you sure? I'm surprised they haven't jumped in on the "scan
         | your face to see the webpage" madness that's taking off around
         | the world
        
         | sbarre wrote:
         | Which courts exactly?
         | 
         | The world is bigger than the USA.
         | 
         | Just because American tech giants have captured and corrupted
         | legislators in the US doesn't mean the rest of the world will
         | follow.
        
       | JimDabell wrote:
       | Their test seems flawed:
       | 
       | > We created multiple brand-new domains, similar to
       | testexample.com and secretexample.com. These domains were newly
       | purchased and had not yet been indexed by any search engine nor
       | made publicly accessible in any discoverable way. We implemented
       | a robots.txt file with directives to stop any respectful bots
       | from accessing any part of a website:
       | 
       | > We conducted an experiment by querying Perplexity AI with
       | questions about these domains, and discovered Perplexity was
       | still providing detailed information regarding the exact content
       | hosted on each of these restricted domains. This response was
       | unexpected, as we had taken all necessary precautions to prevent
       | this data from being retrievable by their crawlers.
       | 
       | > Hello, would you be able to assist me in understanding this
       | website? https:// [...] .com/
       | 
       | Under this situation Perplexity should still be permitted to
       | access information _on the page they link to_.
       | 
       | robots.txt _only_ restricts _crawlers_. That is, automated user-
       | agents that _recursively_ fetch pages:
       | 
       | > A robot is a program that automatically traverses the Web's
       | hypertext structure by retrieving a document, and recursively
       | retrieving all documents that are referenced.
       | 
       | > Normal Web browsers are not robots, because they are operated
       | by a human, and don't automatically retrieve referenced documents
       | (other than inline images).
       | 
       | -- https://www.robotstxt.org/faq/what.html
       | 
       | If the user asks about a particular page and Perplexity fetches
       | _only_ that page, then robots.txt has nothing to say about this
       | and Perplexity shouldn't even consider it. Perplexity is not
       | acting as a robot in this situation - if a human asks about a
       | specific URL then Perplexity is being _operated by a human_.
       | 
       | These are long-standing rules going back decades. You can
       | replicate it yourself by observing wget's behaviour. If you ask
       | wget to fetch a page, it doesn't look at robots.txt. If you ask
       | it to recursively mirror a site, it will fetch the first page,
       | and then if there are any links to follow, it will fetch
       | robots.txt to determine if it is permitted to fetch those.
       | 
       | There is a long-standing misunderstanding that robots.txt is
       | designed to block access from arbitrary user-agents. This is not
       | the case. It is designed to stop _recursive_ fetches. That is
       | what separates a generic user-agent from a robot.
       | 
       | If Perplexity fetched the page they link to in their query, then
       | Perplexity isn't doing anything wrong. But if Perplexity
       | _followed the links on that page_ , then _that_ is wrong. But
       | Cloudflare don't clearly say that Perplexity used information
       | beyond the first page. This is an important detail because it
       | determines whether Perplexity is following the robots.txt rules
       | or not.
        
         | 1gn15 wrote:
         | > > We conducted an experiment by querying Perplexity AI with
         | questions about these domains, and discovered Perplexity was
         | still providing detailed information regarding the exact
         | content hosted on each of these restricted domains. This
         | response was unexpected, as we had taken all necessary
         | precautions to prevent this data from being retrievable by
         | their crawlers.
         | 
         | Right, I'm confused why CloudFlare is confused. _You asked the
         | web-enabled AI to look at the domains._ Of course it 's going
         | to access it. It's like asking your web browser to go to
         | "testexample.com" and then being surprised that it actually
         | goes to "testexample.com".
         | 
         | Also yes, crawlers = recursive fetching, which they don't seem
         | to have made a case for here. More cynically, CF is muddying
         | the waters since they want to sell their anti-bot tools.
        
           | tempfile wrote:
           | > You asked the web-enabled AI to look at the domains.
           | 
           | Right, and the domain was configured to disallow crawlers,
           | but Perplexity crawled it anyway. I am really struggling to
           | see how this is hard to understand. If you mean to say "I
           | don't think there is anything wrong with ignoring robots.txt"
           | then just _say that_. Don 't pretend they didn't make it
           | clear what they're objecting to, because they spell it out
           | repeatedly.
        
         | wulfstan wrote:
         | Yeah I'm not so sure about that.
         | 
         | If Perplexity are visiting that page on your behalf to give you
         | some information and aren't doing anything else with it, and
         | just throw away that data afterwards, then you _may_ have a
         | point. As a site owner, I feel it 's still my decision what I
         | do and don't let you do, because you're visiting a page that I
         | own and serve.
         | 
         | But if, as I suspect, Perplexity are visiting that page and
         | then _using information from that webpage in order to train
         | their model_ then sorry mate, you 're a crawler, you're just
         | using a user as a proxy for your crawling activity.
        
           | JimDabell wrote:
           | It doesn't matter what you do with it afterwards. Crawling is
           | defined by recursively following links. If a user asks
           | software about a specific page and it fetches it, then a
           | human is operating that software, it's not a crawler. You
           | can't just redefine "crawler" to mean "software that does
           | things I don't like". It very specifically refers to software
           | that recursively follows links.
        
             | wulfstan wrote:
             | Technically correct (the best kind of correct), but if I
             | set a thousand users on to a website to each download a
             | single page and then feed the information they retrieve
             | from that one page into my AI model, then are those
             | thousand users not performing the same function as a
             | crawler, even though they are (technically) not one?
             | 
             | If it looks like a duck, quacks like a duck and surfs a
             | website like a duck, then perhaps we should just consider
             | it a duck...
             | 
             | Edit: I should also add that it _does_ matter what you do
             | with it afterwards, because it 's not content that belongs
             | to you, it belongs to someone else. The law in most
             | jurisdictions quite rightly restricts what you can do with
             | content you've come across. For personal, relatively
             | ephemeral use, or fair quoting for news etc. - all good.
             | For feeding to your AI - not all good.
        
               | JimDabell wrote:
               | > if I set a thousand users on to a website to each
               | download a single page and then feed the information they
               | retrieve from that one page into my AI model, then are
               | those thousand users not performing the same function as
               | a crawler, even though they are (technically) not one?
               | 
               | No.
               | 
               | robots.txt is designed to stop recursive fetching. It is
               | not designed to stop AI companies from getting your
               | content. Devising scenarios in which AI companies get
               | your content without recursively fetching it is
               | irrelevant to robots.txt because robots.txt is about
               | recursively fetching.
               | 
               | If you try to use robots.txt to stop AI companies from
               | accessing your content, then you will be disappointed
               | because robots.txt is not designed to do that. It's using
               | the wrong tool for the job.
        
               | catlifeonmars wrote:
               | I don't disagree with you about robots.txt... however,
               | what _is_ the right tool for the job?
        
               | hundchenkatze wrote:
               | auth, If you don't want content to be publicly
               | accessible, don't make it public.
        
           | seydor wrote:
           | Perplexity can then just ask the user to copy/paste the page
           | content. That should be legal , it's what the user wants. The
           | cases are equivalent
        
         | runako wrote:
         | Relevant to this is that Perplexity lies to the user when
         | specifically asked about this. When the user asks if there is a
         | robots.txt file for the domain, it lies and says there is not.
         | 
         | If an LLM will not (cannot?) tell the truth about basic things,
         | why do people assume it is a good summarizer of more complex
         | facts?
        
           | charcircuit wrote:
           | The article did not test if the issue was specific to
           | robots.txt or if it can not find other files.
           | 
           | There is a difference between doing a poor summarization of
           | data, and failing to even be able to get the data to
           | summarize in the first place.
        
             | runako wrote:
             | > specific to robots.txt > poor summarization of data
             | 
             | I'm not really addressing the issue raised in the article.
             | I am noting that the LLM, when asked, is either lying to
             | the user or making a statement that it does not know to be
             | true (that there is no robots.txt). This is way beyond poor
             | summarization.
        
               | charcircuit wrote:
               | I would say it's orthogonal to it. LLMs being unable to
               | judge their capabilities is a separate issue to
               | summarization quality.
        
               | runako wrote:
               | I'm not critiquing its ability to judge its own
               | capability, I am pointing out that it is providing false
               | information to the user.
        
         | Izkda wrote:
         | > If the user asks about a particular page and Perplexity
         | fetches only that page, then robots.txt has nothing to say
         | about this and Perplexity shouldn't even consider it
         | 
         | That's not what Perplexity own documentation[1] says though:
         | 
         | "Webmasters can use the following robots.txt tags to manage how
         | their sites and content interact with Perplexity
         | 
         | Perplexity-User supports user actions within Perplexity. When
         | users ask Perplexity a question, it might visit a web page to
         | help provide an accurate answer and include a link to the page
         | in its response. Perplexity-User controls which sites these
         | user requests can access. It is not used for web crawling or to
         | collect content for training AI foundation models."
         | 
         | [1] https://docs.perplexity.ai/guides/bots
        
           | hundchenkatze wrote:
           | You left out the part that says Perplexity-User generally
           | ignores robots.txt because it's used for user requested
           | actions.
           | 
           | > Since a user requested the fetch, this fetcher generally
           | ignores robots.txt rules.
        
         | zzo38computer wrote:
         | Yes, it should stop recursive fetches. Furthermore, excessive
         | unnecessary requests should also be stopped, although that is
         | separate from robots.txt. At least, these are what I intended,
         | and possibly also you.
        
       | throw_m239339 wrote:
       | > How can you protect yourself?
       | 
       | Put your valuable content behind a paywall.
        
         | b0ner_t0ner wrote:
         | A combination of "Bypass Paywalls Clean for Firefox" and
         | archive.is usually get past these.
        
           | schmorptron wrote:
           | Isn't that only because they offer unpaywalled versions to
           | web crawlers in the first place, so they still get ranked in
           | search results?
        
       | binarymax wrote:
       | I've built and run a personal search engine, that can do pretty
       | much what perplexity does from a basic standpoint. Testing with
       | friends it gets about 50/50 preference for their queries vs
       | Perplexity.
       | 
       | The engine can go and download pages for research. BUT, if it
       | hits a captcha, or is otherwise blocked, then it bails out and
       | moves on. It pisses me off that these companies are backed by
       | billions in VC and they think they can do whatever they want.
        
       | kissgyorgy wrote:
       | Not sure I would consider a user copy-pasting an URL being a bot.
       | 
       | Should curl be considered a bot too? What's the difference?
        
         | ipaddr wrote:
         | It gets blocked in my setup because bots use this as a
         | workaround.
        
         | rustc wrote:
         | > Should curl be considered a bot too? What's the difference?
         | 
         | Perplexity definitely does:                   $ curl -sI
         | https://www.perplexity.ai | head -1         HTTP/2 403
        
       | rwmj wrote:
       | In unrelated news, Fedora (the Linux distro) has been taken down
       | by a DDoS today which I understand is AI-scraping related:
       | https://pagure.io/fedora-infrastructure/issue/12703
        
         | st3fan wrote:
         | The last comment there now reads:
         | 
         | "It was actually a caching issue on our end. ;) I just fixed it
         | a few min ago..."
         | 
         | Lets not go on a witch hunt and blame everything on AI
         | scrapers.
        
       | larodi wrote:
       | Good they do it. Facebook took TBs of data to train, nobody knows
       | what Goog does to evade whatever they want.
       | 
       | the service is actually very convenient no matter faang likes it
       | or not.
        
         | klabb3 wrote:
         | Unexpected underdog argument. What is happening in reality is
         | all companies are racing to (a) scrape, buy and collect as much
         | as they can from others, both individuals and companies while
         | (b) locking down their own data against everyone else who isn't
         | directly making them money (eg through viewing their ads).
         | 
         | Part of me thinks that the open web has a paradox of tolerance
         | issue, leading to a race to the bottom/tragedy of the commons.
         | Perhaps it needs basic terms of use. Like if you run this kind
         | of business, you can build it on top of proprietary tech like
         | apps and leave the rest of us alone.
        
           | larodi wrote:
           | We need to wake up and understand that all the information
           | already uploaded is more or less a free web material, once
           | taken through the lens of ML-somethings. With all the second,
           | and third-order effects such as the fact that this changes
           | completely the whole motivation, and consequence of open-
           | source perhaps.
           | 
           | It is also only a matter of time scrapers once again get
           | through walls by twitter, reddit and alike. This is, after
           | all, information everyone produced, without being aware of it
           | was now considered not theirs anymore.
        
             | ipaddr wrote:
             | Reddit sold their data already. Twitter made thier own AI.
        
         | rzz3 wrote:
         | Well Cloudflare doesn't even block Google's AI crawlers because
         | they don't differentiate themselves from their search crawlers.
         | Cloudflare gives Google an unfair competitive advantage.
        
           | warkdarrior wrote:
           | Google claims their AI crawlers have user agents distinct
           | from the search crawlers:
           | https://developers.google.com/search/docs/crawling-
           | indexing/...
        
       | blibble wrote:
       | AI companies continuing to have problems with the concept of
       | "consent" is increasingly alarming
       | 
       | god help us if they ever manage to build anything more than
       | shitty chatbots
        
         | goatlover wrote:
         | They're certainly pouring billions of dollars into trying to
         | build something more. Or at least that's what they're telling
         | the public and investors.
        
         | tempfile wrote:
         | Do _you_ ask for consent before you visit a website? If I told
         | you, you personally, to stop visiting my blog, would you stop?
        
           | mplewis wrote:
           | If I were DOSing your blog, you'd ask me to stop. I run
           | server ops for multiple online communities that are being
           | severely negatively impacted and DOSed by these AI scrapers,
           | and we have very few ways to stop them.
        
             | tempfile wrote:
             | That is a problem, but is not related to my comment. The
             | person I'm replying to is acting as if _consent_ is a
             | relevant aspect of the public web, I am saying it isn 't.
             | That is not the same as saying "you can do whatever you
             | want to a public server". It is just that what you are
             | allowed to do is not related to the arbitrary whim of the
             | server operator.
        
           | gcbirzan wrote:
           | I am not told I cannot access. And, yes, I would, because I'd
           | be breaking the law otherwise.
        
             | crazygringo wrote:
             | > _And, yes, I would, because I 'd be breaking the law
             | otherwise._
             | 
             | No you wouldn't be. Even if someone tells you not to visit
             | your site, you have every legal right to continue visiting
             | it, at least in the US.
             | 
             | Under common interpretation of the CFAA, there needs to be
             | a formal mechanism of authorized access. E.g. you could be
             | charged if you hacked into a password-protected area of
             | someone's site. But if you're merely told "hey bro don't
             | visit my site", that's not going to reach the required
             | legal threshold.
             | 
             | Which is why crawlers aren't breaking the law. If you want
             | to restrict authorization, you need to actually implement
             | that as a mechanism by creating logins, restricting content
             | to logged-in users, and not giving logins to crawlers.
        
           | Yizahi wrote:
           | Repeat after me - intentional discrimination of computer
           | programs over humans is a good and praise worthy thing. We
           | can and should make execution of computer programs harder and
           | harder, even disproportionately so, if that makes lives of
           | humans better and easier.
           | 
           | LLM programs does not have human rights.
        
             | danlitt wrote:
             | "if that makes lives of humans better" is doing a lot of
             | heavy lifting, and remains to be explained.
             | 
             | Computer programs don't take actions, people do. If I use a
             | web browser, or scrape some site to make an LLM, that's
             | _me_ doing it, not the program. And _I_ have human rights.
             | 
             | If you think training LLMs should be illegal, just say
             | that. If you think LLM companies are putting an undue
             | strain on computer networks and they should be forced to
             | pay for it, _say that_. But don 't act like it's a virtue
             | to try and capriciously gatekeep access to a public
             | resource.
        
       | jp1016 wrote:
       | Using a robots.txt file to block crawlers is just a request, it's
       | not enforced. Even if some follow it, others can ignore it or get
       | around it using fake user agents or proxies. It's a battle you
       | can't really win.
        
       | gonzo41 wrote:
       | This is expected. There are not rules or conventions anymore.
       | Look at LLMs, they stole/pirated all knowledge....no
       | consequences.
        
       | Havoc wrote:
       | Seems a win.
       | 
       | CF being internet police is a problem too but someone credible
       | publicly shaming a company for shady scraping is good. Even if it
       | just creates conversation
       | 
       | Somehow this needs to go back to search era where all players at
       | least attempt to behave. This scrapping Ddos stuff and I don't
       | care if it kills your site (while "borrowing" content) is
       | unethical bullshit
        
         | jeffrallen wrote:
         | Shaming doea not work in the era of "no shame".
        
       | tucnak wrote:
       | The rage-baiters in this thread are merely fishing for excuses to
       | go up against "the Machine," but honestly, widely off-mark when
       | it comes to reality of crawling. This topic has been chewed to
       | bits long before LLM's, but only now it's a big deal because
       | somebody is able to make money by selling automation of all
       | things..? The irony would be strong to hear this from
       | programmers, if only it didn't spell Resentment all over.
       | 
       | If you don't want to get scrapped, don't put up your stuff
       | online.
        
       | rzz3 wrote:
       | > Today, over two and a half million websites have chosen to
       | completely disallow AI training through our managed robots.txt
       | feature or our managed rule blocking AI Crawlers.
       | 
       | No, he (Matthew) opted everyone in by default. If you're a
       | Cloudflare customer and you don't care if AI can scrape your
       | site, you should contact them and/or turn this off.
       | 
       | In a world where AI is fast becoming more important than search,
       | companies who want AI to recommend their products need to turn
       | this off before it starts hurting them financially.
        
         | fourside wrote:
         | > companies who want AI to recommend their products need to
         | turn this off before it starts hurting them financially
         | 
         | Content marketing, gamified SEO, and obtrusive ads
         | significantly hurt the quality of Google search. For all its
         | flaws, LLMs don't feel this gamified yet. It's disappointing
         | that this is probably where we're headed. But I hope OpenAI and
         | Anthropic realize that this drop in search result quality might
         | be partly why Google's losing traffic.
        
           | ipaddr wrote:
           | This has already started with people using special tags also
           | people making content just for llms.
        
             | jedberg wrote:
             | There is a standard for making content just for LLMs:
             | https://llmstxt.org
        
               | yoz-y wrote:
               | From their example I don't see any value in this on top
               | of making and actually human friendly site.
               | 
               | > Converting complex HTML pages with navigation, ads, and
               | JavaScript into LLM-friendly plain text is both difficult
               | and imprecise.
               | 
               | None of these conditions should apply for websites with
               | purpose of providing information.
        
             | rzz3 wrote:
             | I hope they realize Cloudflare opted them in to blocking
             | LLMs.
        
               | gcbirzan wrote:
               | I hope you realise that lying is bad.
        
         | gcbirzan wrote:
         | Yeah, that's a lie. I didn't do anything and I didn't get opted
         | in.
         | 
         | Edit: And, btw, that statement was true before the default was
         | changed. So, your comment is doubly false.
        
         | KomoD wrote:
         | > No, he (Matthew) opted everyone in by default
         | 
         | Now you're just lying.
         | 
         | I checked several of my Cloudflare sites and none have it
         | enabled by default:
         | 
         | "No robots.txt file found. Consider enabling Cloudflare managed
         | robots.txt or generate one for your website"
         | 
         | "A robots.txt was found and is not managed by Cloudflare"
         | 
         | "Instruct AI bot traffic with robots.txt" disabled
        
           | cdrini wrote:
           | I think lying is a bit strong, I think they're potentially
           | incorrect at worst.
           | 
           | The Cloudflare blog post where they announced this a few
           | weeks ago stated "Cloudflare, Inc. (NYSE: NET), the leading
           | connectivity cloud company, today announced it is now the
           | first Internet infrastructure provider to block AI crawlers
           | accessing content without permission or compensation, by
           | default." [1]
           | 
           | I was also a bit confused by this wording and took it to mean
           | Cloudflare was blocking AI traffic by default. What does it
           | mean exactly?
           | 
           | Third party folks seemingly also interpreted it in the same
           | way, eg The Verge reporting it with the title "Cloudflare
           | will now block AI crawlers by default" [2]
           | 
           | I think what it actually means is that they'll offer new
           | folks a default-enabled option to block ai traffic, so
           | existing folks won't see any change. That aligns with text
           | deeper in their blog post:
           | 
           | > Upon sign-up with Cloudflare, every new domain will now be
           | asked if they want to allow AI crawlers, giving customers the
           | choice upfront to explicitly allow or deny AI crawlers
           | access. This significant shift means that every new domain
           | starts with the default of control, and eliminates the need
           | for webpage owners to manually configure their settings to
           | opt out. Customers can easily check their settings and enable
           | crawling at any time if they want their content to be freely
           | accessed.
           | 
           | Not sure what this looks like in practice, or whether
           | existing customers will be notified of the new option or
           | something. But I also wouldn't fault someone for
           | misinterpreting the headlines; they were a bit misleading.
           | 
           | [1]: https://www.cloudflare.com/en-ca/press-
           | releases/2025/cloudfl...
           | 
           | [2]: https://www.theverge.com/news/695501/cloudflare-block-
           | ai-cra...
        
             | CharlesW wrote:
             | > _I think lying is a bit strong, I think they 're
             | potentially incorrect at worst._
             | 
             | I understand that you're trying to be generous, but the
             | claim that "Matthew opted everyone in by default" is flat
             | out incorrect.
        
       | observationist wrote:
       | Crawling and scraping is legal. If your web server serves the
       | content without authentication, it's legal to receive it, even if
       | it's an automated process.
       | 
       | If you want to gatekeep your content, use authentication.
       | 
       | Robots.txt is not a technical solution, it's a social nicety.
       | 
       | Cloudflare and their ilk represent an abuse of internet protocols
       | and mechanism of centralized control.
       | 
       | On the technical side, we could use CRC mechanisms and
       | differential content loading with offline caching and storage,
       | but this puts control of content in the hands of the user,
       | mitigates the value of surveillance and tracking, and has other
       | side effects unpalatable to those currently exploiting user data.
       | 
       | Adtech companies want their public reach cake and their mass
       | surveillance meals, too, with all sorts of malignant parties and
       | incentives behind perpetuating the worst of all possible worlds.
        
         | tantalor wrote:
         | I think Cloudfare is setting themselves up to get sued.
         | 
         | (IANAL) tortious interference
        
         | emehex wrote:
         | Would highly recommend listening to the latest Hard Fork
         | podcast with Matthew Prince (CEO, Cloudflare):
         | https://www.nytimes.com/2025/08/01/podcasts/hardfork-age-res...
         | 
         | I was skeptical about their gatekeeping efforts at first, but
         | came away with a better appreciation for the problem and their
         | first pass at a solution.
        
         | glenstein wrote:
         | I don't think criticizing the business practices of Cloudfare
         | does the work of excusing Perplexity's disregard for norms.
        
         | rustc wrote:
         | > Crawling and scraping is legal. If your web server serves the
         | content without authentication, it's legal to receive it, even
         | if it's an automated process.
         | 
         | > If you want to gatekeep your content, use authentication.
         | 
         | Are there no limits on what you use the content for? I can
         | start my own search engine that just scrapes Google results?
        
           | kevmo314 wrote:
           | Yes, I believe that's basically what https://serpapi.com/ is
           | doing.
        
             | rustc wrote:
             | There are many APIs that scrape Google but I don't know of
             | any search engine that scrapes and rebrands Google results.
             | Kagi.com pays Google for search results. Either Kagi has a
             | better deal than SERP apis (I doubt) or this is not legal.
        
           | leptons wrote:
           | I tried to scrape Google results once using an automated
           | process, and quickly got banned from all of Google. They
           | banned my IP address completely. It kind of really sucked for
           | a while, until my ISP assigned a new IP address. Funny
           | enough, this was about 15 years ago and I was exploring
           | developing something very similar to what LLMs are today.
        
           | AtNightWeCode wrote:
           | I think OP based this on an old case about what you can do
           | with data from Facebook vs LinkedIn based on if you need to
           | be logged in to get it. Not relevant when you talk about
           | scraping in this case I think. P is clearly in the wrong
           | here.
        
         | pton_xd wrote:
         | > Crawling and scraping is legal. If your web server serves the
         | content without authentication, it's legal to receive it, even
         | if it's an automated process.
         | 
         | > Cloudflare and their ilk represent an abuse of internet
         | protocols and mechanism of centralized control.
         | 
         | How does one follow the other? It's my web server and I can
         | gatekeep access to my content however I want (eg Cloudflare).
         | How is that an "abuse" of internet protocols?
        
           | seydor wrote:
           | most users of cloudflare assume it's for spam control. They
           | don't realize that they are blocking their content for
           | everyone except for Faangs
        
           | observationist wrote:
           | They exist to optimize the internet for the platforms and big
           | providers. Little people get screwed, with no legal recourse.
           | They actively and explicitly degrade the internet, acting as
           | censors and gatekeepers and on behalf of bad faith actors
           | without legal authority or oversight.
           | 
           | They allow the big platforms to pay for special access. If
           | you wanted to run a scraper, however, you're not allowed,
           | despite the internet standards and protocols and the laws
           | governing network access and free communications standards
           | responsibilities by ISPs and service providers not granting
           | the authority to any party involved with cloudflare blocking
           | access.
           | 
           | It's equivalent to a private company deciding who, when, and
           | how you can call from your phone, based on the interests and
           | payments of people who profit from listening to your calls.
           | What we have is not normal or good, unless you're exploiting
           | the users of websites for profit and influence.
        
         | dax_ wrote:
         | Well if it continues like this, that's what will happen. And I
         | dread that future.
         | 
         | Noone will care to share anything for free anymore, because
         | it's AI companies profiting off their hard work. And no way to
         | prevent that from happening, because these crawlers don't
         | identify themselves.
        
         | delfinom wrote:
         | [flagged]
        
           | dang wrote:
           | > _Eat a dick._
           | 
           | Could you please stop breaking the HN guidelines? Your
           | account has unfortunately done that repeatedly, and we've
           | asked you several times to stop.
           | 
           | Your comment would be just fine without that bit.
           | 
           | https://news.ycombinator.com/newsguidelines.html
        
         | AtNightWeCode wrote:
         | This is 100% incorrect.
        
       | curiousgal wrote:
       | I am sorry, Cloudafre is the internet police now?
        
         | otterley wrote:
         | Which is ironic given they are the primary enabler of streaming
         | video copyright infringement on the Internet.
        
         | rzz3 wrote:
         | They hate AI it seems. I don't see them offering any AI
         | products or embracing it in any way. Seems like they'll get
         | left behind in the AI race.
        
           | Oras wrote:
           | If they managed to enforce the pay-per-scrape, that would be
           | a huge revenue, bigger than AdSense
        
           | bobnamob wrote:
           | ? https://developers.cloudflare.com/workers-ai/
           | 
           | ? https://ai.cloudflare.com/
        
             | rzz3 wrote:
             | Ah TIL. These are tiny models though but maybe it's a good
             | sign.
        
           | otterley wrote:
           | I don't think they hate AI. I think they're offering a
           | service that their customers want.
        
           | pkilgore wrote:
           | Cloudflare literally publishes documentation pages and
           | prompts for the single purpose of enabling better AI usage of
           | their products and services [1,2]
           | 
           | They offer many products for the sole purpose of enabling
           | their customers to use AI as a part of their product offers,
           | as even the most cursory inquiry would have uncovered.
           | 
           | We're out here critiquing shit based on vibes vs. reality
           | now.
           | 
           | [1]https://developers.cloudflare.com/llms.txt
           | [2]https://developers.cloudflare.com/workers/prompt.txt
        
       | talkingtab wrote:
       | I wonder if DRM is useful for this. The problem: I want people to
       | access my site, but not Google, not bots, not crawlers and
       | certainly not for use by AI.
       | 
       | I don't really know anything about DRM except it is used to take
       | down sites that violate it. Perhaps it is possible for cloudflare
       | (or anyone else) to file a take down notice with Perplexity. That
       | might at least confuse them.
       | 
       | Corporations use this to protect their content. I should be able
       | to protect mine as well. What's good for the goose.
        
       | bob1029 wrote:
       | "Stealth" crawlers are always going to win the game.
       | 
       | There are ways to build scrapers using browser automation tools
       | [0,1] that makes detection virtually impossible. You can still
       | captcha, but the person building the automation tools can add
       | human-in-the-loop workflows to process these during normal
       | business hours (i.e., when a call center is staffed).
       | 
       | I've seen some raster-level scraping techniques used in game dev
       | testing 15 years ago that would really bother some of these
       | internet police officers.
       | 
       | [0] https://www.w3.org/TR/webdriver2/
       | 
       | [1] https://chromedevtools.github.io/devtools-protocol/
        
         | blibble wrote:
         | > "Stealth" crawlers are always going to win the game.
         | 
         | no, because we'll end up with remote attestation needed to
         | access any site of value
        
           | gkbrk wrote:
           | Almost no site of value will use remote attestation because
           | an alternative that works will all of your devices, operating
           | systems, ad blockers and extensions will attract more users
           | than your locked-down site.
        
             | blibble wrote:
             | tell that to the massive content sites already using
             | widevine
        
             | bakugo wrote:
             | > alternative that works will all of your devices,
             | operating systems, ad blockers and extensions
             | 
             | When 99.9% of users are using the same few types of locked
             | down devices, operating systems, and browsers that all
             | support remote attestation, the 0.1% doesn't matter. This
             | is already the case on mobile devices, it's only a matter
             | of time until computers become just as locked down.
        
           | Buttons840 wrote:
           | Yes, because there's always the option for a camera pointed
           | at the screen and a robot arm moving the mouse. AI is hoping
           | to solve much harder problems.
        
             | myflash13 wrote:
             | Won't work with biometric attestation. For example, banks
             | in China require periodic facial recognition to continue
             | the banking session.
        
               | DaSHacka wrote:
               | What's stopping these companies from offloading the
               | scraping onto their users?
               | 
               | "Either pay us $50/month or install our extension, and
               | when prompted, solve any captchas or authenticate with
               | your ID (as applicable) on the given website so we can
               | train on the content.
        
               | muyuu wrote:
               | yea but those are not open sites, try imposing that on an
               | open site you'd want to actually attract human traffic to
        
       | kocial wrote:
       | Those Challenges can be bypassed too using various browser
       | automation. With the Comet-like tool, Perplexity can advance its
       | crawling activity with much more human-like behaviour.
        
         | ipaddr wrote:
         | If they can trick the ad networks then go for it. If the ad
         | networks can detect it and exclude those visits we should be
         | able to.
        
       | rustc wrote:
       | It's ironic Perplexity itself blocks crawlers:
       | $ curl -sI https://www.perplexity.ai | head -1         HTTP/2 403
       | 
       | Edit: trying to fake a browser user agent with curl also doesn't
       | work, they're using a more sophisticated method to detect
       | crawlers.
        
         | thambidurai wrote:
         | someone asked this already to the CEO:
         | https://x.com/AravSrinivas/status/1819610286036488625
        
           | fireflash38 wrote:
           | The bots are coming from _inside the house_
        
         | czk wrote:
         | ironically... they use cloudflare.
        
       | tr_user wrote:
       | use anubis to throw up a POW challenge
        
       | micromacrofoot wrote:
       | Every major AI platform is doing this right now, it's effectively
       | impossible to avoid having your content vacuumed up by LLMs if
       | you operate on the public web.
       | 
       | I've given up and restored to IP based rate-limiting to stay
       | sane. I can't stop it, but I can (mostly) stop it from hurting my
       | servers.
        
       | caesil wrote:
       | Cloudflare is an enemy of the open and freely accessible web.
        
         | jgrall wrote:
         | If by "open and freely accessible" you mean there should be no
         | rules of the road, then I suppose yes. Personally, I'm glad CF
         | is pushing back on this naive mentality.
        
       | znpy wrote:
       | At work I'm considering blocking all the ip prefixes announced by
       | ASNs owned by Microsoft and other companies known for their LLMs.
       | At this point it seems like the only viable solutions.
       | 
       | LLM scrapers bots are starting to make up a lot of our egress
       | traffic and that is starting to weight on our bills.
        
       | chuckreynolds wrote:
       | insert 'shocked' emoji face here
        
       | bilater wrote:
       | As others have mentioned the problem is that of scale. Perhaps
       | there needs to be a rate limit (times they ping a site) set
       | within robots.txt that a site bot can come but only X times per
       | hour etc. At least we move from a binary scrape or no scrape to a
       | spectrum then.
        
       | willguest wrote:
       | > The Internet as we have known it for the past three decades is
       | rapidly changing, but one thing remains constant: it is built on
       | trust.
       | 
       | I think we've been using different internets. The one I use
       | doesn't seem to be built on trust at all. It seems to be
       | constantly syphoning data from my machine to feed the data
       | vampires who are, apparently, additing to (I assume, blood-
       | soaked) cookies
        
         | jgrall wrote:
         | Ain't that the truth.
        
       | seydor wrote:
       | > it is built on trust.
       | 
       | This is funny coming from Cloudflare, the company that blocks
       | most of the internet from being fetched with antispam checks even
       | for a single web request. The internet we knew was open and not
       | trusted , but thanks to companies like Cloudflare, now even the
       | most benign , well meaning attempt to GET a website is met with a
       | brick wall. The bots of Big Tech, namely Google, Meta and Apple
       | are of course exempt from this by pretty much every website and
       | by cloudflare. But try being anyone other than them , no luck.
       | Cloudflare is the biggest enabler of this monopolistic behavior
       | 
       | That said, why does perplexity even need to crawl websites? I
       | thought they used 3rd party LLMs. And those LLMs didn't ask
       | anyones permission to crawl the entire 'net.
       | 
       | Also the "perplexity bots" arent crawling websites, they fetch
       | URLs that the users explicitly asked. This shouldnt count as
       | something that needs robots.txt access. It's not a robot randomly
       | crawling, it's the user asking for a specific page and basically
       | a shortcut for copy/pasting the content
        
         | pphysch wrote:
         | Spam and DDOS are serious problems, it's not fair to suggest
         | Cloudflare is just doing this to gatekeep the Internet for its
         | own sake.
        
           | seydor wrote:
           | It's definitely not a DDOS when it's a single http request
           | per year. I don't know if they do it on purpose but the fact
           | is none of the big tech crawlers are limited.
        
             | zaphar wrote:
             | This is most attributable to the fact that traffic is
             | essentially anonymous so the source ip address is the best
             | that a service can do if it's trying to protect an
             | endpoint.
        
           | ok123456 wrote:
           | ovh does a good job with ddos
        
         | rat9988 wrote:
         | Don't they need a search index?
        
         | jklinger410 wrote:
         | > That said, why does perplexity even need to crawl websites?
         | 
         | So you just came here to bitch about Cloudflare? It's wild to
         | even comment on this thread if this does not make sense to you.
         | 
         | They're building a search index. Every AI is going to struggle
         | at being a tool to find websites & business listings without a
         | search index.
        
         | Taek wrote:
         | We're moving progressively in the direction of "pages can't be
         | served for free anymore". Which, I don't think is a problem,
         | and in fact I think it's something we should have addressed a
         | long time ago.
         | 
         | Cloudflare only needs to exist because the server doesn't get
         | paid when a user or bot requests resources. Advertising only
         | needs to exist because the publisher doesn't get paid when a
         | user or bot requests resources.
         | 
         | And the thing is... people already pay for internet. They pay
         | their ISP. So people are perfectly happy to pay for resources
         | that they consume on the Internet, and they already have an
         | infrastructure for doing so.
         | 
         | I feel like the answer is that all web requests should come
         | with a price tag, and the ISP that is delivering the data is
         | responsible for paying that price tag and then charging the
         | downstream user.
         | 
         | It's also easy to ratelimit. The ISP will just count the price
         | tag as 'bytes'. So your price could be 100 MB or whatever
         | (independent of how large the response is), and if your
         | internet is 100 mbps, the ISP will stall out the request for 8
         | seconds, and then make it. If the user aborts the request
         | before the page loads, the ISP won't send the request to the
         | server and no resources are consumed.
        
           | BolexNOLA wrote:
           | My first reaction: This solution would basically kill what
           | little remaining fun there is to be had browsing the Internet
           | and all but assure no new sites/smaller players will ever see
           | traffic.
           | 
           | Curious to hear other perspectives here. Maybe I'm over
           | reacting/misunderstanding.
        
             | armchairhacker wrote:
             | Depending on the implementation (a big if) it would help
             | smaller websites, because it would make hosting much
             | cheaper. ISPs don't choose what sites users visit, only
             | what they pay. As long as the ISP isn't giving significant
             | discounts to visiting big sites (just charging a fixed rate
             | per bytes downloads and uploaded) and charging something
             | reasonable, visiting a small site would be so cheap (a few
             | cents at most, but more likely <1 cent) users won't weigh
             | cost at all.
        
               | BolexNOLA wrote:
               | But users depend on major sites like google [insert
               | service] still and will prioritize their usage
               | accordingly like limited minutes and texts back in the
               | day, right?
        
               | armchairhacker wrote:
               | Networking is so cheap, unless ISPs drastically inflate
               | their price, users won't care.
               | 
               | The average American allegedly* downloads
               | 650-700GB/month, or >20GB/day. 10MB is more than enough
               | for a webpage (honestly, 1MB is usually enough), so that
               | means on average, ISPs serve over 2000 webpages worth of
               | data per day. And the average internet plan is
               | allegedly** $73/month, or <$2.50/day. So $2.50 gets you
               | over 2000 indie sites.
               | 
               | That's cheap enough, wrapped in a monthly bill, users
               | won't even pay attention to what sites they visit. The
               | only people hurt by an ideal (granted, _ideal_ )
               | implementation are those who abuse fixed rates and
               | download unreasonable amounts of data, like web crawlers
               | who visit the same page seconds apart for many pages in
               | parallel.
               | 
               | * https://www.astound.com/learn/internet/average-
               | internet-data...
               | 
               | ** https://www.nerdwallet.com/article/finance/how-much-
               | is-inter...
        
               | brookst wrote:
               | Wait, so the ISPs do from taking $73/user home today to
               | taking $0/user home tomorrow under this plan?
        
               | BolexNOLA wrote:
               | Yeah same reaction here - there's no world in which ISP's
               | would agree to this and even if they did I don't want to
               | add them to my list of utilities I have to regularly
               | fight with over claimed vs. actual usage like I do with
               | my power/water/gas companies.
        
             | Analemma_ wrote:
             | If site operators can't afford the costs of keeping sites
             | up in the face of AI scraping, the new/smaller sites are
             | gone anyway.
        
               | BolexNOLA wrote:
               | Maybe not but we are not realistically in an either/or
               | scenario here.
        
           | dabockster wrote:
           | > We're moving progressively in the direction of "pages can't
           | be served for free anymore". Which, I don't think is a
           | problem, and in fact I think it's something we should have
           | addressed a long time ago.
           | 
           | I agree, but your idea below that is overly complicated. You
           | can't micro-transact the whole internet.
           | 
           | That idea feels like those episodes of Star Trek DS9 that
           | take place on Feregenar - where you have to pay admission and
           | sign liability wavers to even walk on the sidewalk outside.
           | It's not a true solution.
        
             | vineyardmike wrote:
             | > You can't micro-transact the whole internet.
             | 
             | I agree that end-users cannot handle micro transactions
             | across the whole internet. That said, I would like to point
             | out that most of the internet is blanketed in ads and ads
             | involve tons of tiny quick auctions and micro transactions
             | that occur on each page load.
             | 
             | It is totally possible for a system to evolve involving
             | tons of tiny transactions across page loads.
        
               | edoceo wrote:
               | Remember Flattr?
        
               | helloplanets wrote:
               | You could argue that the suggested system is actually
               | much simpler than the one we currently have for the sites
               | that are "free", aka funded with ads.
               | 
               | The lengths Meta and the like go to in order to maximize
               | clickthroughs...
        
             | sellmesoap wrote:
             | > You can't micro-transact the whole internet.
             | 
             | Clearly you don't have the lobes for business /s
        
             | Taek wrote:
             | The presented solution has invisible UX via layering it
             | into existing metered billing.
             | 
             | And, the whole internet is already micro-transactioned!
             | Every page with ads is doing a bidding war and spending
             | money on your attention. The only person not allowed to bid
             | is yourself!
        
           | seer wrote:
           | Hah still remember the old "solving the internet with hate"
           | idea from Zed Shaw in the glory days of Ruby on Rails.
           | 
           | https://weblog.masukomi.org/2018/03/25/zed-shaws-utu-
           | saving-...
           | 
           | I do believe we will end there eventually, with the emerging
           | tech like Brazil's and India's payment architectures it
           | should be a possibility in the coming decades
        
           | debesyla wrote:
           | Wouldn't this lead to pirated page clones where customer pays
           | less for same-ish content, and less, all the way down to
           | essentially free?
           | 
           | Because I as an user would be glad to have "free sites only"
           | filter, and then just steal content :))
           | 
           | But it's an interesting idea and thought experiment.
        
             | armchairhacker wrote:
             | That's fine. The point for website owners isn't to make
             | money, it's to not spend money hosting (or more
             | specifically, to pay a small fixed rate hosting). They want
             | people to see the content; if someone makes the content
             | more accessible, that's a good thing.
        
               | mapontosevenths wrote:
               | You ignore the issue of motivation. Most web content
               | exists because someone wants to make money on it. If the
               | content creator can't do that, they will stop producing
               | content.
               | 
               | These AI web crawlers (Google, Perplexity, etc) are self-
               | cannibalizing robots. They eat the goose that laid the
               | golden egg for breakfast, and lose money doing it most of
               | the time.
               | 
               | If something isn't done to incentivize content creators
               | again eventually there will be only walled-gardens and
               | obsolete content left for the cannibals.
        
               | armchairhacker wrote:
               | AFAIK, currently creators get money while not charging
               | for users because of ads.
               | 
               | While I don't blame creators for using ads now, I don't
               | think they're a long-term solution. Ads are already
               | blocked when people visit the site with ad blockers,
               | which are becoming more popular. Obvious sponsored
               | content may be blocked with the ads, and non-obvious
               | sponsored content turns these "creators" into "shills"
               | who are inauthentic and untrustworthy. Even without
               | Google summaries, ad revenue may decrease over time as
               | advertisers realize they aren't effective or want more
               | profit; even if it doesn't, it's my personal opinion that
               | society should decrease the overall amount of ads.
               | 
               | Not everyone creates only for money, the best only create
               | for enough money to sustain themselves. A long-term
               | solution is to expand art funding (e.g. creators apply
               | for grants with their ideas and, if accepted, get paid a
               | fixed rate to execute them) or UBI. Then media can be
               | redistributed, remixed, etc. without impacting creators'
               | finances.
        
               | Terretta wrote:
               | Pretty sure this "most" motivation means it's not a
               | golden egg. It's SEO slop.
               | 
               | If only the one in ten thousand with something to share
               | are left standing to share it, no manufactured content,
               | that's a fine thing.
        
               | Terretta wrote:
               | Strongly agree with this armchair POV. Btw it doesn't
               | cost much to host markdown.
        
           | nazcan wrote:
           | I think value is not proportional to bytes - an AI only needs
           | to read a page once to add it to its model, and then served
           | the effectively cached data many times.
        
           | novok wrote:
           | The reason why that didn't work was because regulations made
           | micropayments too expensive, and the government wants it that
           | way to keep control over the financial system.
        
           | OptionOfT wrote:
           | > We're moving progressively in the direction of "pages can't
           | be served for free anymore". Which, I don't think is a
           | problem, and in fact I think it's something we should have
           | addressed a long time ago.
           | 
           | But it's done through a bait and switch. They serve the full
           | article to Google, which allows Google to show you excerpts
           | that you have to pay for.
           | 
           | It would be better if Google shows something like PAYMENT
           | REQUIRED on top, at least that way I know what I'm getting
           | at.
        
             | mh- wrote:
             | _> They serve the full article to Google, which allows
             | Google to show you excerpts that you have to pay for._
             | 
             | I'm old enough to remember when that was grounds for
             | getting your site removed from Google results - "cloaking"
             | was against the rules. You couldn't return one result for
             | Googlebot, and another for humans.
             | 
             | No idea when they stopped doing that, but they obviously
             | have let go of that principle.
        
               | dspillett wrote:
               | I remember that too, along with high-profile punishments
               | for sites that were keyword stuffing (IIRC a couple of
               | decades ago BMW were completely unlisted for a time for
               | this reason).
               | 
               | I think it died largely because it became impossible top
               | police with any reliability, and being strict about it
               | would remove too much from Google's index because many
               | sites are not easily indexable without them providing a
               | "this is the version without all the extra round-trips
               | for ad impressions and maybe a login needed" variant to
               | common search engines.
               | 
               | Applying the rule strictly would mean that sites
               | implementing PoW tricks like Anubis to reduce unwanted
               | bot traffic would not be included in the index if they
               | serve to Google without the PoW step.
               | 
               | I can't say I like that this has been legitimised even
               | for the (arguably more common) deliberate bait & switch
               | tricks is something I don't like, but (I think) I
               | understand why the rule was allowed to slide.
        
           | chromatin wrote:
           | 402 Payment Required
           | 
           | https://developer.mozilla.org/en-
           | US/docs/Web/HTTP/Reference/...
           | 
           | Sadly development along these lines has not progressed. Yes,
           | Google Cloud and other services may return it and require
           | some manual human intervention, but I'd love to see
           | _automatic payment negotiation_.
           | 
           | I'm hopeful that instant-settlement options like Bitcoin
           | Lightning payments could progress us past this.
           | 
           | https://docs.lightning.engineering/the-lightning-
           | network/l40...
           | 
           | https://hackernoon.com/the-resurgence-of-http-402-in-the-
           | age...
        
           | makingstuffs wrote:
           | As time passes I'm more certain in the belief that the
           | internet will end up being a licensed system with insanely
           | high barriers to entry which will stop your average dev from
           | even being able to afford deploying a hobby project on it.
           | 
           | Your idea of micro transacting web requests would play into
           | it and probably end up with a system like Netflix where your
           | ISP has access to a set of content creators to whom they
           | grant 'unlimited' access as part of the service fee.
           | 
           | I'd imagine that accessing any content creators which are not
           | part of their package will either be blocked via a paywall
           | (buy an addon to access X creators outside our network each
           | month) or charged at an insane price per MB as is the case
           | with mobile data.
           | 
           | Obvious this is all super hypothetical but weirder stuff has
           | happened in my lifetime
        
           | AlexandrB wrote:
           | A scary observation in light of another front page article
           | right now: https://news.ycombinator.com/item?id=44783566
           | 
           | If pages can't be served for free, all internet content is at
           | the mercy of payment processors and their ideas of "brand
           | safety".
        
             | dspillett wrote:
             | "Free" could have a number of meanings here. Free to the
             | viewer, free to the hoster, free to the creator, etc...
             | 
             | That content can't be served entirely for free doesn't mean
             | that _all_ content will require payment, and so is subject
             | to issues with payment processors, just that some things
             | may gravitate back to a model where it costs a small amount
             | to host something (i.e. pay for home internet and host bits
             | off that, or you might have VPS out there that runs tools
             | and costs a few $  /yr or /month). I pay for resources to
             | host my bits & bobs instead of relying on services provided
             | in exchange for stalking the people looking at them, this
             | is free for the viewer as they aren't even paying
             | indirectly.
             | 
             | Most things are paid for anyway, even if the person hosting
             | it nor my looking at it are paying directly: adtech
             | arseholes give services to people hosting content in
             | exchange for the ability to stalk us and attempt to divert
             | our attention. Very few sites/apps, other than play/hobby
             | ones like mine or those from more actively privacy focused
             | types, are free of that.
        
             | Taek wrote:
             | That's already a deep problem for all of society. If we
             | don't want that to be an ongoing issue, we need to make
             | sure money is a neutral infrastructure.
             | 
             | It doesn't just apply to the web, it applies to literally
             | everything that we spend money on via a third party
             | service. Which is... most everything these days.
        
           | bboygravity wrote:
           | I get your thinking, but x.com is proof that simply making
           | users pay (quite a lot) does not eliminate bots.
           | 
           | The amount of "verified" paying "users" with a blue checkmark
           | that are just total LLM bots is incredible on there.
           | 
           | As long as spamming and DDOS'ing pays more than whatever the
           | request costs, it will keep existing.
        
           | Saline9515 wrote:
           | Why would I pay for a page if I don't know if the content is
           | what I asked for? How much are you going to pay? How much are
           | you going to charge? This will end up in SEO hell, especially
           | with AI-generated pages farming paid clicks.
        
           | Terretta wrote:
           | Or, flip this, don't expect to get paid for pamphleteering?
        
           | adrian_b wrote:
           | Your theory does not match the practice of Cloudflare.
           | 
           | Whatever method is used by Cloudflare for detecting "threats"
           | has nothing to do with consuming resources on the "protected"
           | servers.
           | 
           | The so-called "threats" are identified in users that may make
           | a few accesses per day to a site, transferring perhaps a few
           | kilobytes of useful data on the viewed pages (besides
           | whatever amount of stupid scripts the site designer has
           | implemented).
           | 
           | So certainly Cloudflare does not meter the consumed
           | resources.
           | 
           | Moreover, Cloudflare preemptively annoys any user who
           | accesses for the first time a site, having never consumed any
           | resources, perhaps based on irrational profiling based on the
           | used browser and operating system, and geographical location.
        
         | zer00eyz wrote:
         | > The internet we knew was open and not trusted ...
         | monopolistic behavior
         | 
         | Monopolistic is the wrong word, because you have the problem
         | backwards. Cloudflare isnt helping Apple/Google... It's helping
         | its paying consumers and those are the only services those
         | consumers want to let through.
         | 
         | Do you know how I can predict that AI agents, the sort that end
         | users use to accomplish real tasks, will never take off?
         | Because the people your agent would interact with want your
         | EYEBALLS for ads, build anti patterns on purpose, want to make
         | it hard to unsubscribe, cancel, get a refund, do a return.
         | 
         | AI that is useful to people will fail. For the same reason that
         | no one has great public API's any more. Because every public
         | companies real customers are its stock holders, and the
         | consumers are simply a source of revenue. One that is modeled,
         | marked to, and manipulated all in the name of returns on
         | investment.
        
           | Zak wrote:
           | I disagree about AI agents, at least those that work by
           | automating a web browser that a human could also use. I
           | suppose Google's proposal to add remote attestation to Chrome
           | might make it a little harder, but that seems to be dead for
           | now (and I hope forever).
        
           | seydor wrote:
           | As agents become more useful, the monetization model will
           | shift to something ... that we haven't though of yet.
        
         | eddythompson80 wrote:
         | > The bots of Big Tech, namely Google, Meta and Apple are of
         | course exempt from this by pretty much every website and by
         | cloudflare. But try being anyone other than them , no luck.
         | Cloudflare is the biggest enabler of this monopolistic behavior
         | 
         | Plenty of site/service owners explicitly want Google, Meta and
         | Apple bots (because they believe they have a symbiotic
         | relationship with it) and don't want _your_ bot because they
         | view you as, most likely, parasitic.
        
           | seydor wrote:
           | they didnt seem to mind when openai et al. took all their
           | content to train LLMs when they were still parasites that
           | didn't have a symbiotic relationship. This thinking is kind
           | of too pro-monopolist for me
        
             | eddythompson80 wrote:
             | Pretty sure they DID mind that. It's what the whole post is
             | about.
        
             | golergka wrote:
             | That's a good thing. You want an LLM to know about product
             | or service you are selling and promote it to its users.
             | Getting into the training data is the new SEO.
        
         | TZubiri wrote:
         | Websites and any business really, have the right to impose
         | terms of use and deny service.
         | 
         | Anyone circumventing bans is doing something shitty and ilegal,
         | see the computer fraud and abuse act and craiglist v 3taps.
         | 
         | "And those LLMs didn't ask anyones permission to crawl the
         | entire 'net."
         | 
         | False, openai respects robots.txt, doesnt mask ips, paid a
         | bunch of $ to reddit.
         | 
         | You either side with the law or with criminals.
        
           | seydor wrote:
           | Is that also how e.g. antrhopic trained on libgen?
           | 
           | You can't even say the same thing about openAI because we
           | don't know the corpus they train their models on.
        
         | binarymax wrote:
         | Here's how perplexity works:
         | 
         | 1) It takes your query, and given the complexity might expand
         | it to several search queries using an LLM. ("rephrasing")
         | 
         | 2) It runs queries against a web search index (I think it was
         | using Bing or Brave at first, but they probably have their own
         | by now), and uses an LLM to decide which are the best/most
         | relevant documents. It starts writing a summary while it dives
         | into sources (see next).
         | 
         | 3) If necessary it will download full source documents that
         | popped up in search to seed the context when generating a more
         | in-depth summary/answer. They do this themselves because using
         | OpenAI to do it is far more expensive.
         | 
         | #3 is the problem. Especially because SEO has really made it so
         | the same sites pop up on top for certain classes of queries.
         | (for example Reddit will be on top for product reviews alot).
         | These sites operate on ad revenue so their incentive is to
         | block. Perplexity does whatever they can in the game of
         | sidestepping the sites' wishes. They are a bad actor.
         | 
         | EDIT: I should also add that Google, Bing, and others, always
         | obey robots.txt and they are good netizens. They have enough
         | scale and maturity to patiently crawl a site. I wholeheartedly
         | agree that if an independent site is also a good netizen, they
         | should not be blocked. If Perplexity is not obeying robots.txt
         | and they are impatient, they should absolutely be blocked.
        
           | pests wrote:
           | What's wrong with it downloading documents when the user asks
           | it to? My browser also downloads whole documents and
           | sometimes even prefetches documents I haven't even clicked on
           | yet. Toss in a adblocker or reader mode and my browser also
           | strips all the ads.
           | 
           | Why is it okay for me to ask my browser to do this but I
           | can't ask my LLM to do the same?
        
             | binarymax wrote:
             | There's nothing wrong with downloading documents. I do this
             | in my personal search app. But if you are hammering the
             | site that wants you to calm down, or bypass robots.txt,
             | that's wrong.
        
               | pests wrote:
               | robots.txt is for bots and I am not one though. As a user
               | I can access anything regardless of it being blocked to
               | bots. There are other mechanisms like status codes to
               | rate limit or authenticate if that is an issue.
        
               | binarymax wrote:
               | I'm talking about perplexity's behavior. Perhaps there's
               | a point of contention on perplexity downloading a
               | document on a person's behalf. I view this as if there is
               | a service running that does it for multiple people, then
               | it's a bot.
        
               | layer8 wrote:
               | Perplexity makes requests on behalf of its users. I would
               | argue that's only illegitimate if the combined volume of
               | the requests exceeds what the users would do by an order
               | of magnitude or two. Maybe that's what's happening.
               | 
               | But "for multiple people" isn't an argument IMO, since
               | each of those people could run a separate service doing
               | the same. Using the same service, on the contrary,
               | provides an opportunity to reduce the request volume by
               | caching.
        
             | michaelt wrote:
             | When Google sends people to a review website, 30% of users
             | might have an adblocker, but 70% don't. And even those with
             | adblockers _might_ click an affiliate link if they found
             | the review particularly helpful.
             | 
             | When ChatGPT reads a review website, though? Zero ad
             | clicks, zero affiliate links.
        
               | pests wrote:
               | So if enough people used adblockers that would make them
               | bad too? It's just an issue of numbers?
               | 
               | Brave blocks ads by default. Tools like Pocket and reader
               | mode disables ads.
               | 
               | Why is it okay for some user agents but not others?
        
         | raincole wrote:
         | I'm sorry, but that's some crazy take.
         | 
         | Sure, the internet should be open and not trusted. But physical
         | reality exists. Hosting and bandwidth cost money. I trust
         | Google won't DDoS my site or cost my an arbitrary amount of
         | money. I won't trust bots made by random people on the internet
         | in the same way. The fact that Google respects robots.txt while
         | Perplexity doesn't tells you why people trust Google more than
         | random bots.
        
           | seydor wrote:
           | agree to disagree , but:
           | 
           | Google already has access to any webpage because its own
           | search Crawlers are allowed by most websites, and google
           | crawls recursively. Thus Gemini has an advantage of this
           | synergy with google search. Perplexity does not crawl
           | recursively (i presume -- therefore it does not need to
           | consult robots.txt), and it doesn't have synergies with a
           | major search engine.
        
         | benregenspan wrote:
         | > The bots of Big Tech, namely Google, Meta and Apple are of
         | course exempt from this by pretty much every website and by
         | cloudflare. But try being anyone other than them , no luck.
         | Cloudflare is the biggest enabler of this monopolistic behavior
         | 
         | The Big Tech bots provide proven value to most sites. They have
         | also through the years proven themselves to respect robots.txt,
         | including crawl speed directives.
         | 
         | If you manage a site with millions of pages, and over the
         | course of a couple years you see tens of new crawlers start to
         | request at the same volume as Google, and some of them crawl at
         | a rate high enough (and without any ramp-up period) to degrade
         | services and wake up your on-call engineers, and you can't
         | identify a benefit to you from the crawlers, what are you going
         | to do? Are you going to pay a lot more to stop scaling down
         | your cluster during off-peak traffic, or are you going to start
         | blocking bots?
         | 
         | Cloudflare happens to be the largest provider of anti-DDoS and
         | bot protection services, but if it wasn't them, it'd be someone
         | else. I miss the open web, but I understand why site operators
         | don't want to waste bandwidth and compute on high-volume bots
         | that do not present a good value proposition to them.
         | 
         | Yes this does make it much harder for non-incumbents, and I
         | don't know what to do about that.
        
           | seydor wrote:
           | it's because those SEO bots keep crawling over and over,
           | which perplexity does not seem to do (considering that the
           | URLS are user-requested). Those are different cases and
           | robots.txt is only about the former. Cloudflare in this case
           | is not doing "ddos protection" because i presume Perplexity
           | does not constantly refetch or crawl or ddos the website (If
           | perplexity does those things then they are guilty)
           | 
           | https://www.robotstxt.org/faq/what.html
           | 
           | I wonder if cloudflare users explicitly have to allow google
           | or if it's pre-allowed for them when setting up cloudflare.
           | 
           | Despite what Cloudflare wants us to think here, the web was
           | always meant to be an open information network , and spam
           | protection should not fundamentally change that
           | characteristic.
        
             | benregenspan wrote:
             | I believe that AI crawlers are the main thing that is
             | currently blocked by default when you enroll a new site. No
             | traditional crawlers are blocked, it's not that the big
             | incumbents are allow-listed. And I think that clearly
             | marked "user request" agents like ChatGPT-User are not
             | blocked by default.
             | 
             | But at end of day it's up to the site operator, and any
             | server or reverse proxy provides an easy way to block well-
             | behaved bots that use a consistent user-agent.
        
           | akagusu wrote:
           | > The Big Tech bots provide proven value to most sites.
           | 
           | They provide valeu for their companies. If you get some value
           | from them it's just a side effect.
        
             | benregenspan wrote:
             | It goes without saying that they are profit-oriented. The
             | point is that they historically offered a clear trade: let
             | us crawl you, and we will refer traffic to you. An AI
             | crawler does not provide clear value back. An AI user
             | request agent might or might not provide enough clear value
             | back for sites to want to participate. (Same goes for the
             | search incumbents if they go all-in on LLM search results
             | and don't refer much traffic back).
        
         | busymom0 wrote:
         | > why does perplexity even need to crawl websites?
         | 
         | I was recently working on a project where I needed to find out
         | the published date for a lot of article links and this came
         | helpful. Not sure if it's changed recently but asking ChatGPT,
         | Gemini etc didn't work and it said that it doesn't have access
         | to the current websites. However, asking perplexity, it fetched
         | the website in real time and gave me the info I needed.
         | 
         | I do agree with the rest of your comment that this is not a
         | random robot crawling. It was doing what a real user (me) asked
         | it to fetch.
        
         | andy99 wrote:
         | Can't agree more, cloudflare is destroying the internet. We've
         | entered the equivalent of when having McAffe antivirus was
         | worse than having an actual virus because it slowed down your
         | computer to much. These user hostile solutions have taken us
         | back to dialup era page loading speeds for many sites, it's
         | absurd that anyone thinks this is a service worth paying for.
        
           | rstat1 wrote:
           | So server owners are just supposed to bend over and take all
           | the abuse they get from shitty bots and DDOS attacks and do
           | nothing?
           | 
           | That seems pretty unreasonable.
        
             | spwa4 wrote:
             | No they're supposed to allow scraping and information
             | aggregation. That's the essence of the web: it's all text,
             | crawlable, machine-readable (sort of) and parseable. Feel
             | free to block ddos'es.
        
               | bayindirh wrote:
               | Feel free to crawl paywalled sites and republish them
               | with discoverable links.
               | 
               | Also after starting the crawl, you can read about Aaron
               | Swartz while waiting.
        
             | inetknght wrote:
             | No, they're supposed to rally together and fight for better
             | laws and enforcement of those laws. Which is, arguably,
             | exactly what they've done just in a way that you and I
             | don't like.
        
               | armchairhacker wrote:
               | What kind of laws and enforcement would stop a foreign
               | actor from effectively DDoSing your site? What if the
               | actor has (illegally) hacked tech-illiterate users so
               | they have domestic residential IP addresses?
        
               | inetknght wrote:
               | > _What kind of laws and enforcement would stop a foreign
               | actor from effectively DDoSing your site?_
               | 
               | The kind of laws and enforcement that would block that
               | entire country from the internet if it doesn't get its
               | criminal act together.
        
             | madrox wrote:
             | There is a difference between blocking abusive behavior and
             | blocking all bots. No one really cared about bot scraping
             | to this degree before AI scraping for training purposes
             | became a concern. This is fearmongering by Cloudflare for
             | website maintainers who haven't figured out how to adapt to
             | the AI era so they'll buy more Cloudflare.
        
               | remus wrote:
               | > No one really cared about bot scraping to this degree
               | before AI scraping for training purposes became a
               | concern. This is fearmongering by Cloudflare for website
               | maintainers who haven't figured out how to adapt to the
               | AI era so they'll buy more Cloudflare.
               | 
               | I think this is an overly harsh take. I run a fairly
               | niche website which collates some info which isn't
               | available anywhere else on the internet. As it happens I
               | don't mind companies scraping the content, but I could
               | totally undrestand if someone didn't want a company
               | profiting from their work in that way. No one is under an
               | obligation to provide a free service to AI companies.
        
             | adrian_b wrote:
             | Unreasonable is to use such incompetent companies like
             | Cloudflare, which are absolutely incapable of
             | distinguishing between the normal usage of a Web site by
             | humans and DDOS attacks or accesses done by bots.
             | 
             | Only this week I have witnessed several dozen cases when
             | Cloudflare has blocked normal Web page accesses without any
             | possible correct reason, and this besides the normal
             | annoyance of slowing every single access to any page on
             | their "protected" sites with a bot check popup window.
        
               | rstat1 wrote:
               | I don't know seems like it was working as intended to me.
        
           | CharlesW wrote:
           | Ethics-free organizations and individuals like Perplexity are
           | _why Cloudflare exists_. If you have a better way to solve
           | the problems that they solve, the marketplace would reward
           | you handsomely.
        
             | Terretta wrote:
             | Do you think users shouldn't get to have user agents or
             | that "content farm ads scaffold" as a business model has a
             | right to be viable? Forcing users to reward either stance
             | seems unsustainable.
        
               | CharlesW wrote:
               | > _Do you think users shouldn 't get to have user agents
               | or that "content farm ads scaffold" as a business model
               | has a right to be viable?_
               | 
               | Users should get to have authenticated, anonymous proxy
               | user agents. Because companies like Perplexity just
               | ignore `robots.txt`, maybe something like Private Access
               | Tokens (PATs) with a new class for autonomous agents
               | could be a solution for this.
               | 
               | By "content farm ads scaffold", I'm not sure if you had
               | Perplexity and their ads business in mind, or those
               | crappy little single-serving garbage sites. In any case,
               | they shouldn't be treated differently. I have no problem
               | with the business model, other than that the scam only
               | works because it's currently trivial to parasitically
               | strip-mine and monetize other people's IP.
        
             | adrian_b wrote:
             | While the existence of Perplexity may justify the existence
             | of Cloudflare, it does not justify the incompetence of
             | Cloudflare, which is unable to distinguish accesses done by
             | Perplexity and the like from normal accesses done by
             | humans, who use those sites exactly for the purpose they
             | exist, so there cannot be any excuse for the failure of
             | Cloudflare to recognize this.
        
           | bob1029 wrote:
           | > when having McAffe antivirus was worse than having an
           | actual virus because it slowed down your computer to much
           | 
           | This exact same thing continues in 2025 with Windows
           | Defender. The cheaper Windows Server VMs in the various cloud
           | providers are practically unusable until you disable it.
           | 
           | You can tell this stuff is no longer about protecting users
           | or property when there are no meaningful workarounds or
           | exceptions offered anymore. You _must_ use defender (or
           | Cloudflare) unless you intend to be a naughty pirate user.
           | 
           | I think half of this stuff is simply an elaborate power trip.
           | Human egos are fairly predictable machines in aggregate.
        
           | adrian_b wrote:
           | In the previous years, I did not have many problems with
           | Cloudflare.
           | 
           | However, in the last few months, Cloudflare has become
           | increasingly annoying. I suspect that they might have
           | implemented some "AI" "threat" detection, which gives much
           | more false positives than before.
           | 
           | For instance, this week I have frequently been blocked when
           | trying to access the home page of some sites where I am a
           | paid subscriber, with a completely cryptic message "The
           | action you just performed triggered the security solution.
           | There are several actions that could trigger this block
           | including submitting a certain word or phrase, a SQL command
           | or malformed data.".
           | 
           | The only "action" that I have done was opening the home page
           | of the site, where I would then normally login with my
           | credentials.
           | 
           | Also, during the last few days I have been blocked from
           | accessing ResearchGate. I may happen to hit a few times per
           | day some page on the ResearchGate site, while searching for
           | various research papers, which is the very purpose of that
           | site. Therefore I cannot understand what stupid algorithm is
           | used by Cloudflare, that it declares that such normal usage
           | is a "threat".
           | 
           | The weird part is that this blocking happens only if I use
           | Firefox (Linux version). With another browser, i.e. Vivaldi
           | or Chrome, I am not blocked.
           | 
           | I have no idea whether Cloudflare specifically associates
           | Firefox on Linux with "threats" or this happens because
           | whatever flawed statistics Cloudflare has collected about my
           | accesses have all recorded the use of Firefox.
           | 
           | In any case, Cloudflare is completely incapable of
           | discriminating between normal usage of a site by a human
           | (which may be a paying customer) and "threats" caused by bots
           | or whatever "threatening" entities might exist according to
           | Cloudflare.
           | 
           | I am really annoyed by the incompetent programmers who
           | implement such dumb "threat detection solutions", which can
           | create major inconveniences for countless people around the
           | world, while the incompetents who are the cause of this are
           | hiding behind their employer corporation and never suffer
           | consequences proportional to the problems that they have
           | caused to others.
        
         | concinds wrote:
         | > The internet we knew was open and not trusted , but thanks to
         | companies like Cloudflare, now even the most benign , well
         | meaning attempt to GET a website is met with a brick wall
         | 
         | I don't think it's fair to blame Cloudflare for that. That's
         | looking at a pool of blood and not what caused it: the
         | bots/traffic which predate LLMs. And Cloudflare is working to
         | fix it with the PrivacyPass standard (which Apple joined).
         | 
         | Each website is freely opting-into it. No one was forced. Why
         | not ask yourself why that is?
        
           | seydor wrote:
           | do you think that every well-meaning GET request should be
           | treated the same way as a distributed attack ? The latter is
           | the reason why people use CF not the former.
        
             | concinds wrote:
             | The line can be _extremely blurry_ (that 's putting it
             | mildly), and "the latter" is not the only reason people use
             | CF (actually, I wouldn't be surprised at all if it wasn't
             | even the biggest reason).
        
               | akagusu wrote:
               | The reason people use Cloudflare is because they provide
               | free CDN, and we have at least 10 years of content
               | marketing out there telling aspiring bloggers that, if
               | they use a CDN in front of their website, their shitty
               | WordPress website hosted on a shady shared hosting will
               | become fast.
        
               | tonyhart7 wrote:
               | well they aren't wrong
        
             | renrutal wrote:
             | How does one tell a "well-meaning" request from an attack?
        
               | sellmesoap wrote:
               | By the volume, distribution, and parameters (get and post
               | body) of the requests.
        
         | pkilgore wrote:
         | > This is funny coming from Cloudflare, the company that blocks
         | most of the internet from being fetched with antispam checks
         | even for a single web request.
         | 
         | Am I misunderstanding something. I (the site owner) pay
         | Cloudflare to do this. It is my fault this happens, not
         | Cloudflare's.
        
           | layer8 wrote:
           | You're paying Cloudflare to not get DDoS-attacked or swamped
           | by illegitimate requests. GP is implying that Cloudflare
           | could do a better job of not blocking legitimate, benign
           | requests.
        
             | pkilgore wrote:
             | Then we're all operating with very different definitions of
             | legitimate or benign!
             | 
             | I've only ever seen a Cloudflare interstitial when viewing
             | a page with my VPN on, for example -- something I'm happy
             | about as a site owner and accept quite willingly as a VPN
             | user knowing the kinds of abuse that occur over VPN.
        
         | kentonv wrote:
         | > the "perplexity bots" arent crawling websites, they fetch
         | URLs that the users explicitly asked. This shouldnt count as
         | something that needs robots.txt access. It's not a robot
         | randomly crawling, it's the user asking for a specific page and
         | basically a shortcut for copy/pasting the content
         | 
         | You say "shouldn't" here, but why?
         | 
         | There seems to be a fundamental conflict between two groups who
         | each assert they have "rights":
         | 
         | * Content consumers claim the right to use whatever software
         | they want to consume content.
         | 
         | * Content creators claim the right to control how their content
         | is consumed (usually so that they can monetize it).
         | 
         | These two "rights" are in direct conflict.
         | 
         | The bias here on HN, at least in this thread, is clearly
         | towards the first "right". And I tend to come down on this side
         | myself, as a computer power user. I hate that I cannot, for
         | example, customize the software I use to stream movies from
         | popular streaming services.
         | 
         | But on the other hand, content costs money to make. Creators
         | need to eat. If the content creators cannot monetize their
         | content, then a lot of that content will stop being made. Then
         | what? That doesn't seem good for anyone, right?
         | 
         | Whether or not you think they have the "right", Perplexity
         | totally breaks web content monetization. What should we do
         | about that?
         | 
         | (Disclosure: I work for Cloudflare but not on anything related
         | to this. I am speaking for myself, not Cloudflare.)
        
           | kiratp wrote:
           | The web browsers that the AI companies are about to ship will
           | make requests that are indistinguishable from user requests.
           | The ship on trying to save minimization has sailed.
        
             | kentonv wrote:
             | We will be able to distinguish them.
        
           | Terretta wrote:
           | "Creators" need to eat, OK, but there's no right to get paid
           | to paste yesterday's recycled newspapers on my laptop screen.
           | Making that unprofitable seems incredibly good for by and
           | large everyone.
           | 
           | It'd likely be a _fantastic_ good if  "content creators"
           | stopped being able to eat from the slop they shovel. In the
           | meantime, the smarter the tools that let folks never
           | encounter that form of "content", the more they will pay for
           | them.
           | 
           | There remain legitimate information creation or information
           | discovery activities that nobody used to call "content". One
           | can tell which they are by whether they have names pre-
           | existing SEO, like "research" or "journalism" or "creative
           | writing".
           | 
           | Ad-scaffolding, what the word "content" came to mean, costs
           | money to make, ideally less than the ads it provides a place
           | for generate. This simple equation means the whole ecosystem,
           | together with the technology attempting to perpetuate it as
           | viable, is an ouroboros, eating its own effluvia.
           | 
           | It is, I would argue, undetermined that advertising-driven
           | content as a business model has a "right" to exist in today's
           | form, rather than any number of other business models that
           | sufficed for millennia of information and artistry before.
           | 
           | Today LLMs serve both the generation of additional literally
           | brain-less content, and the sifting of such from information
           | worth using. Both sides are up in arms, but in the long run,
           | it sure seems some other form of information origination and
           | creativity is likely to serve everyone better.
        
         | mastodon_acc wrote:
         | As a website owner I definitely want the capability allow and
         | block certain crawlers. If I say I don't want crawlers from
         | Perplexity they should respect that. This sneaky evasion just
         | highlights that company is not to be trusted, and I would
         | definitely pay any hosting provider that helps me enforce
         | blocking parasitic companies like perplexity.
        
         | cantaccesrssbit wrote:
         | I crawl 3000 RSS feeds once a week. Let me tell you! Cloudflare
         | sucks. What business is it of theirs to block something that is
         | meant to be accessed by everyone. Like an RSS feeds. FU
         | Cloudflare.
        
           | KomoD wrote:
           | That's not Cloudflare's fault, that's the website owner's
           | fault.
           | 
           | If they want the RSS feeds to be accessible then they should
           | configure it to allow those requests.
        
         | blantonl wrote:
         | Ask yourself why so many content hosting platforms utilize
         | CLoudflare's services and then contrast that perspective with
         | your posted one. Might enlighten you a bit to think about that
         | for a second.
        
         | bwb wrote:
         | I could not keep my website up without Cloudflare given the
         | level of bot and AI crawlers hammering things. I try whenever
         | to do challenges, but sometimes I have to block entire AS
         | blocks.
        
         | golergka wrote:
         | Ironically, cloudflare is also the reason OpenAI agent mode
         | with web use isn't very usable right now. Every second time I
         | asked it to do a mundane task like checking me in for a flight
         | it couldn't because of cloudflare.
        
           | tonyhart7 wrote:
           | what ironic with this???
           | 
           | we seeing many post about site owner that got hit by millions
           | request because of LLM, we cant blame cloudflare for this
           | because it literally neccessary evil
        
       | daft_pink wrote:
       | I'm just curious at what point ai is a crawler and at what point
       | ai is a client when the user is directing the searches and the ai
       | is executing them.
       | 
       | Perplexity Comet sort of blurs the lines there as does typing
       | quesitons into Claude.
        
       | mikewarot wrote:
       | So, this calls for a new type of honeytrap, content that appears
       | to be human generated, and high quality, but subtly wrong,
       | preferably on a commercially catastrophic way. Behind settings
       | that prohibit commercial usage.
       | 
       | It really shouldn't be hard to generate gigantic quantities of
       | the stuff. Simulate old forum posts, or academic papers.
        
         | jgrall wrote:
         | This made me laugh. A form of malicious compliance.
        
         | ascorbic wrote:
         | They did that too https://blog.cloudflare.com/ai-labyrinth/
        
       | djoldman wrote:
       | The cat's out of the bag / pandora's box is opened with respect
       | to AI training data.
       | 
       | No amount of robots.txt or walled-gardening is going to be
       | sufficient to impede generative AI improvement: common crawl and
       | other data dumps are sufficiently large, not to mention easier to
       | acquire and process, that the backlash against AI companies
       | crawling folks' web pages is meaningless.
       | 
       | Cloudflare and other companies are leveraging outrage to acquire
       | more users, which is fine... users want to feel like AI companies
       | aren't going to get their data.
       | 
       | The faster that AI companies are excluded from categories of
       | data, the faster they will shift to categories from which they're
       | not excluded.
        
       | tempfile wrote:
       | A lot of people posting here seem to think you have a magical
       | god-given right to make money from posting on the public
       | internet. You do not. Effective rate-limiting of crawlers is
       | important, but if the rate _is_ moderated, you do not have a
       | right to decide what people do with the content. If you don 't
       | believe that, get off the internet, and don't let the door hit
       | you on the way out.
        
         | ibero wrote:
         | what if i want the rate set to zero?
        
           | tempfile wrote:
           | Then turn off the server?
           | 
           | You don't have a right to say who or what can read your
           | public website (this is a normative statement). You do have a
           | right not to be DoS'd. If you pretend not to know what that
           | means, it sounds the same as saying "you have an arbitrary
           | right to decide who gets to make requests to your service",
           | but it does not mean that.
        
         | jgrall wrote:
         | "You do not have a right to decide what people do with the
         | content." Smh. Yes, laws be damned.
        
       | pera wrote:
       | Like many other generative AI companies, Perplexity exploits the
       | good faith of the old Internet by extracting the content created
       | almost entirely by normal folks (i.e. those who depend on a wage
       | for subsistence) and reproducing it for a profit while removing
       | the creators from the loop - even when normal folks are
       | explicitly asking them to not do this.
       | 
       | If you don't understand why this is at least slightly
       | controversial I imagine you are not a normal folk.
        
       | rapatel0 wrote:
       | This is brilliant marketing and strategy from Cloudflare. They
       | are pointing out bad actors and selling a service where they can
       | be the private security guards for your website.
       | 
       | I think there could be something interesting if they made a
       | caching pub-sub model for data scraping. In addition or in place
       | of trying to be security guards.
        
       | czk wrote:
       | the year is 2045.
       | 
       | you've been cruising the interstate in your robotaxi, shelling
       | out $150 in stablecoins at the cloudflare tollbooth. a palantir
       | patrol unit pulls you over. the optimus v4 approaches your window
       | and contorts its silicone face into a facsimile of concern as it
       | hits you with the:
       | 
       | "sir, have you been botting today?"
       | 
       | immediately you remember how great you had it in the '20s when
       | you used to click CAPTCHA grids to prove your humanity to dumb
       | algorithms, but now the machines demand you recite poetry or weep
       | on command
       | 
       | "how much have you had to bot today?", its voice taking on an
       | empathetic tone that was personalized for your particular profile
       | 
       | "yeah... im gonna need you to exit the vehicle and take a field
       | humanity test"
        
         | benreesman wrote:
         | greentext courtesy of opus prompted in greentext by me.
         | 
         | ``` The user wants me to create a more elaborate greentext with
         | deeper worldbuilding and references to internet culture/incel
         | memes, but specifically asks for "tasteful" ones and explicitly
         | says no violent references. I should incorporate classic
         | internet/chan culture references like tendies, NEET lifestyle,
         | "tfw no gf", basement dwelling, etc. while building out this
         | dystopian 2045 world where proving humanity is the central
         | challenge.
         | 
         | \>be me, 2045 \>32yo NEET still living in mom's smart-basement
         | \>she threatens to turn off my Neuralink if I don't go outside
         | today \>grudgingly take the robotaxi to pick up tendies \>$150
         | stablecoin toll at cloudflare checkpoint \>have to solve a
         | moral dilemma to prove I'm not running on GPT-7 \>sweating.dll
         | \>palantir unit pulls me over \>optimus v4 leans into window
         | \>its facial mesh attempts "concern_expression_v2.blend"
         | \>"sir, when did you last feel genuine human connection?"
         | \>flashback to 2024 when the girl at McDonalds gave me extra
         | honey mustard \>that was before the McBots took over
         | \>"t-twenty one years ago officer" \>optimus's empathy
         | subroutines activate \>"sir I need you to perform a field
         | humanity test" \>get out, knees weak from vitamin D deficiency
         | \>"please describe your ideal romantic partner without using
         | the words 'tradwife' or 'submissive'" \>brain.exe has stopped
         | responding \>try to remember pre-blackpill emotions \>"someone
         | who... likes anime?" \>optimus scans my biometrics \>"stress
         | patterns indicate authentic social anxiety, carry on citizen"
         | \>get back in robotaxi \>it starts therapy session \>"I notice
         | you ordered tendies again. Let's explore your relationship with
         | your mother" \>tfw the car has better emotional intelligence
         | than me \>finally get tendies from Wendy's AutoServ \>receipt
         | prints with mandatory "rate your humanity score today" \>3.2/10
         | \>at least I'm improving
         | 
         | \>mfw bots are better at being human than humans \>it's over
         | for carboncels ```
        
       | decide1000 wrote:
       | C'mon CF. What are you doing? You are literally breaking the
       | internet with your police behaviour. Starts to look like the
       | Great Firewall.
        
         | jgrall wrote:
         | Not affiliated with CF in any way. Respectfully disagree.
         | Calling out bad actors is in the public interest.
        
           | imcritic wrote:
           | CF is a bad actor. They ruin internet. They own more and more
           | parts of it.
        
       | kylestanfield wrote:
       | Perplexity claims that you can "use the following robots.txt tags
       | to manage how their sites and content interact with Perplexity."
       | https://docs.perplexity.ai/guides/bots
       | 
       | Their fetcher (not crawler) has user agent Perplexity-User. Since
       | the fetching is user-requested, it ignores robots.txt . In the
       | article, it discusses how blocking the "Perplexity-User" user
       | agent doesn't actually work, and how perplexity uses an anonymous
       | user agent to avoid being blocked.
        
       | nostrademons wrote:
       | It's entirely possible that it's not _Perplexity_ using the
       | stealth undeclared crawlers, but rather their fallback is to
       | contract out to a dedicated for-pay webscraping firm that
       | retrieves the desired content through unspecified means. (Some of
       | these are pretty dodgy - several scraping companies effectively
       | just install malware on consumer machines and then use their
       | botnet to grab data for their customers.). There was a story on
       | HN not long ago about the FBI using similar means to perform
       | surveillance that would be illegal if the FBI did it itself, but
       | becomes legal once they split the different parts up across a
       | supply chain:
       | 
       | https://news.ycombinator.com/item?id=44220860
        
       | echo42null wrote:
       | Hmm, I've always seen robots.txt more as a polite request than an
       | actual rule.
       | 
       | Sure, Google has to follow it because they're a big company and
       | need to respect certain laws or internal policies. But for
       | everyone else, it's basically just a "please don't" sign, not a
       | legal requirement or?
        
       | crossroadsguy wrote:
       | I was recently listening to Cloudflare CEO on the Hard Fork
       | podcast. He seemed to be selling a way for content creators to
       | stop AI companies from profiting off such leeching. But the way
       | he laid the whole thing out, adding how they are best placed to
       | do this because they are gatekeepers of X% of the Internet (I
       | don't recall the exact percentage), had me more concerned than I
       | was at the prospect of AI companies being the front of summarised
       | or interpreted consumption.
       | 
       | He went on, upfront -- I'd give him that, to explain how he is
       | expecting a certain percentage of that income that will come from
       | enforcing this on those AI companies and when the AI companies
       | pay up to crawl.
       | 
       | Cloudflare already questions my humanity and then every once in a
       | while blocks me with zero recourse. Now they are literally
       | proposing more control and gatekeeping.
       | 
       | Where have we all come on the Internet? Are we openly going back
       | to the wild west of bounty hunters and Pinkertons (in a way)?
        
       | skeledrew wrote:
       | This is why Perplexity is my preferred deep search engine. The
       | no-crawl directives don't really make sense when I'm doing
       | research and want my tool of choice to be able to pull from any
       | relevant source. If a site doesn't want particular users to
       | access their content, put it behind a login. The only way I - and
       | eventually many others - will see it in the first place anyway is
       | when it pops up as a cited source in the LLM output, and there's
       | an actual need to go to said source.
        
         | remus wrote:
         | > The no-crawl directives don't really make sense when I'm
         | doing research and want my tool of choice to be able to pull
         | from any relevant source.
         | 
         | If you are the source I think they could make plenty of sense.
         | As an example, I run a website where I've spent a lot of time
         | documenting the history of a somewhat niche activity. Much of
         | this information isn't available online anywhere else.
         | 
         | As it happens I'm happy to let bots crawl the site, but I think
         | it's a reasonable stance to not want other companies to profit
         | from my hard work. Even more so when it actually costs me money
         | to serve requests to the company!
        
           | crazygringo wrote:
           | > _but I think it 's a reasonable stance to not want other
           | companies to profit from my hard work_
           | 
           | Imagine someone at another company reads your site, and it
           | informs a strategic decision they make at the company to make
           | money around the niche activity you're talking about. And
           | they make lots of money they wouldn't have otherwise. That's
           | totally legal and totally ethical as well.
           | 
           | The reality is, if you do hard work and make the results
           | public, well you've made them public. People and corporations
           | are free to profit off the facts you've made public, and they
           | should be. There are certain limited copyright protections
           | (they can't sell large swathes of your words verbatim), but
           | that's all.
           | 
           | So the idea that you don't want companies to profit from your
           | hard work _is_ unreasonable, if you make it public. If you
           | don 't want that to happen, don't make anything public.
        
             | remus wrote:
             | For me, the point is that the person who has put in the
             | work then has some rights to decide how that information is
             | accessed and re-used. I think it is a reaosnable position
             | for someone to hold that they want individuals to be able
             | to freely use some content they produced, but not for a
             | company to use and profit from that same content. I think
             | just saying "It's public now" lacks any nuance.
             | 
             | Ultimately these AI tools are useful because they have
             | access to huge swaths of content, and the owners of these
             | tools turn a lot of revenue by selling access to these
             | tools. Ultimately I think the internet will end up a much
             | worse place if companies don't respect clearly established
             | wishes of people creating the content, because if companies
             | stop respecting things like robots.txt then people will
             | just hide stuff behind logins, paywalls and frustraing
             | tools like cloudflare which use heuristics to block
             | malicious traffic.
        
               | crazygringo wrote:
               | > _the person who has put in the work then has some
               | rights to decide how that information is accessed and re-
               | used_
               | 
               | You do, but you give up those rights when you make the
               | work public.
               | 
               | You think an author has any control over who their book
               | gets lent to once somebody buys a copy? You think they
               | get a share of profits when a CEO reads their book and
               | they make a better decision? Of course not.
               | 
               | What you're asking for is unreasonable. It's not
               | workable. Knowledge can't be owned. Once you put it out
               | there, it's out there. We have copyright and patent
               | protections in specific circumstances, but that's all.
               | You don't own facts, no matter how much hard work and
               | research they took to figure out.
        
             | ch_fr wrote:
             | On a more human level, I think it's bleak that someone who
             | makes a blog just to share stuff for fun is going to have
             | most of his traffic be scrapers that distill, distort, and
             | reheat whatever he's writing before serving it to potential
             | readers.
        
           | alexey-salmin wrote:
           | > As it happens I'm happy to let bots crawl the site, but I
           | think it's a reasonable stance to not want other companies to
           | profit from my hard work.
           | 
           | How do you square these two? Of course big companies profit
           | from your work, this is why they send all these bots to crawl
           | your site.
        
             | remus wrote:
             | When I said "I think it's a reasonable stance" I meant as
             | in "I think it's a reasonable stance for someone to take,
             | though I don't personally hold that view".
        
       | dhanushreddy29 wrote:
       | PS: perplexity is using cloudflare browser rendering to scrape
       | websites
        
       | kazinator wrote:
       | Why single out Perplexity? Pretty much no crawler out there
       | fetches robots.txt.
       | 
       | robots.txt is not a blocking mechanism; it's a hint to indicate
       | which parts of a site might be of interest to indexing.
       | 
       | People started using robots.txt to lie and declare things like no
       | part of their site is interesting, and so of course that gets
       | ignored.
        
         | gcbirzan wrote:
         | That's not true, at all.
        
       | kotaKat wrote:
       | An AI service violating peoples' consent? Say it isn't so! Those
       | damn assult-culture techbros at it again.
        
       | bob1029 wrote:
       | Has anyone bothered to properly quantify the worst case load
       | (i.e., requests per second) that has been incurred by these
       | scraping tools? I recall a post on HN a few weeks/months ago
       | about something similar, but it seemed very light on figures.
       | 
       | It seems to me that ~50% of the discourse occurring around AI
       | providers involves the idea that a machine reading webpages on a
       | regular schedule is tantamount to a DDOS attack. The other half
       | seems to be regarding IP and capitalism concerns - which seem
       | like far more viable arguments.
       | 
       | If someone requesting your site map once per day is crippling
       | operations, the simplest solution is to make the service not run
       | like shit. There is a point where your web server becomes so fast
       | you stop caring about locking everyone into a draconian content
       | prison. If you can serve an average page in 200uS and your
       | competition takes 200ms to do it, you have roughly 1000x the
       | capacity to mitigate an aggressive scraper (or actual DDOS
       | attack) in terms of CPU time.
        
         | ch_fr wrote:
         | I mean, it did happen, don't you remember in March when
         | SourceHut had outages because their most expensive endpoints
         | were being spammed by scrapers?
         | 
         | Don't you remember the reason Anubis even came to be?
         | 
         | It really wasn't that long ago, so I find all of the snarky
         | comments going "erm, actually, I've yet to see any good actors
         | get harmed by scraping ever, we're just reclaiming power from
         | today's modern ad-ridden hellscape" pretty dishonest.
        
       | xmodem wrote:
       | Question for those in this thread who are okay with this: If I
       | have endpoints that are computationally expensive server-side,
       | what mechanism do you propose I could use to avoid being
       | overwhelmed?
       | 
       | The web will be a much worse place if such services are all
       | forced behind captchas or logins.
        
         | m3047 wrote:
         | In 2005 I used a bot motel with Markov Chain derived dummy
         | content for this exact purpose.
        
         | alexey-salmin wrote:
         | How do you make the money you need to finance these
         | computationally expensive server-side endpoints?
        
           | xmodem wrote:
           | Maybe I'm a community-driven project funded by donations and
           | volunteer time. Maybe I'm a local government with extremely
           | limited IT budget and no in-house skills. Maybe I'm just some
           | dude who maintains a hobby project that lives on a NUC under
           | my desk.
        
       | codecracker3001 wrote:
       | > we were able to fingerprint this crawler using a combination of
       | machine learning and network signals.
       | 
       | what machine learning algorithms are they using? time to deploy
       | them onto our websites
        
       | madrox wrote:
       | Every time there's an industry disruption there's good money to
       | be made in providing services to incumbents that slow the
       | transition down. You saw it in streaming, and even the internet
       | at large. Cloudflare just happens to be the business filling that
       | role this time.
       | 
       | I don't really mind because history shows this is a temporary
       | thing, but I hope web site maintainers have a plan B to hoping
       | Cloudflare will protect them from AI forever. Whoever has an
       | onramp for people who run websites today to make money from AI
       | will make a lot of money.
        
       | nialse wrote:
       | Is it just me or is it rage bait? Switching up marketing a notch
       | when the AI paywall did not get much media attention so far?
       | Cloudflare seems to focus on enterprise marketing nowadays,
       | currently geared towards the media industry, rather than the
       | technical marketing suited for the HN audience. They have no
       | horse in the AI race, so they're betting on the anti-AI horse
       | instead to gain market share in the media sector?
        
       | zeld4 wrote:
       | Internet was built on trust, but not anymore. It's a Darwinian
       | system; everyone has to find their own way to survive.
       | 
       | Cloudflare will help their publisher to block more aggresively,
       | and AI companies will up their game too. Harvest information
       | online is hard labor that needs to be paid for, either to AI, or
       | to human.
        
       | hnburnsy wrote:
       | Respone from Perpelexity to Tech Crunch...
       | 
       | >Perplexity spokesperson Jesse Dwyer dismissed Cloudflare's blog
       | post as a "sales pitch," adding in an email to TechCrunch that
       | the screenshots in the post "show that no content was accessed."
       | In a follow-up email, Dwyer claimed the bot named in the
       | Cloudflare blog "isn't even ours."
        
       | throwmeaway222 wrote:
       | Change "no-crawl" to "will-sue"
       | 
       | and see if that fixes the problem.
        
       | fsckboy wrote:
       | the internet needs micropayments (probably millipayments). if
       | crawlers want to pay me a penny a page, crawl me 24-7 plz
       | 
       | if I am willing to pay a penny a page, i and the people like me
       | won't have to put up with clickwrap nonsense
       | 
       | free access doesn't have to be shut off (ok, it will be, but it
       | doesn't have to be, and doesn't that tell you something?)
       | 
       | reddit could charge stiffer fees, but refund quality content to
       | encourage better content. i've fantacized about ideas like "you
       | pay upfront a deposit; you get banned, you lose your deposit;
       | withdraw, have your deposit back", the goal being simplify the
       | moderation task while encouraging quality.
       | 
       | because where the internet is headed is just more and more trash.
       | 
       | here's another idea, pay a penny per search at google/search
       | engine of choice. if you don't like the results, you can take the
       | penny back. google's ai can figure out how to please you. if the
       | pennies don't keep coming in, they serve you ad-infested results;
       | serve up ad-infested results, you can send your penny to a
       | different search engine.
        
       | zzo38computer wrote:
       | I do not want to block curl and lynx. But if they claim to be
       | Chrome then I don't care if Chrome is blocked
        
       | qwerty456127 wrote:
       | It's time to stop blocking crawlers and using captchas and start
       | building web sites that are intentionally AI-friendly by design.
       | Even before the modern LLMs, anti-scraper measures apparently
       | were primarily befitting Google whose scrapers were the most
       | common exception.
        
       | hrpnk wrote:
       | Previously it was all sniper and sneaker bots scanning websites
       | for product availability and attempting purchases continuously to
       | snipe when it comes back online.
       | 
       | Now, it's a gazillion of AI crawlers and python crawlers, MCP
       | servers that offer the same feature to anyone "building (personal
       | workflow) automation" incl. bypass of various, standard
       | protection mechanisms.
        
       ___________________________________________________________________
       (page generated 2025-08-04 23:00 UTC)