[HN Gopher] Perplexity is using stealth, undeclared crawlers to ...
___________________________________________________________________
Perplexity is using stealth, undeclared crawlers to evade no-crawl
directives
Author : rrampage
Score : 886 points
Date : 2025-08-04 13:39 UTC (9 hours ago)
(HTM) web link (blog.cloudflare.com)
(TXT) w3m dump (blog.cloudflare.com)
| gruez wrote:
| >We conducted an experiment by querying Perplexity AI with
| questions about these domains, and discovered Perplexity was
| still providing detailed information regarding the exact content
| hosted on each of these restricted domains
|
| Thats... less conclusive than I'd like to see, especially for a
| content marketing article that's calling out a company in
| particular. Specifically it's unclear on whether Perplexity was
| crawling (ie. systematically viewing every page on the site
| without the direction of a human), or simply retrieving content
| on behalf of the user. I think most people would draw a
| distinction between the two, and would at least agree the latter
| is more acceptable than the former.
| thoroughburro wrote:
| > I think most people would draw a distinction between the two,
| and would at least agree the latter is more acceptable than the
| former.
|
| No. I should be able to control which automated retrieval tools
| can scrape my site, regardless of who commands it.
|
| We can play cat and mouse all day, but I control the content
| and I will always win: I can just take it down when annoyed
| badly enough. Then nobody gets the content, and we can all
| thank upstanding companies like Perplexity for that collapse of
| trust.
| gkbrk wrote:
| > Then nobody gets the content, and we can all thank
| upstanding companies like Perplexity for that collapse of
| trust.
|
| But they didn't take down the content, you did. When people
| running websites take down content because people use Firefox
| with ad-blockers, I don't blame Firefox either, I blame the
| website.
| Bluescreenbuddy wrote:
| FF isn't training their money printer with MY data. AI
| scrapers are
| glenstein wrote:
| >But they didn't take down the content, you did.
|
| That skips the part about one party's unique role in the
| abuse of trust.
| hombre_fatal wrote:
| Taking down the content because you're annoyed that people
| are asking questions about it via an LLM interface doesn't
| seem like you're winning.
|
| It's also a gift to your competitors.
|
| You're certainly free to do it. It's just a really faint
| example of you being "in control" much less winning over LLM
| agents: Ok, so the people who cared about your content can't
| access it anymore because you "got back" at Perplexity, a
| company who will never notice.
| ipaddr wrote:
| It could be my server keeps going down because of llms
| agents keep requesting pages from my lyric site. Removing
| that site allowed other sites to remain up. True story.
|
| Who cares if perplexity will never notice. Or competitors
| get an advantage. It is a negative for users using
| perplexity or visiting directly because the content doesn't
| exist.
|
| That's the world perplexity and others are creating. They
| will be able to pull anything from the web but nothing will
| be left.
| IncreasePosts wrote:
| You don't win, because presumably you were providing the
| content for some reason, and forcing yourself to take it down
| is contrary to whatever reason that was in the first place.
| ipaddr wrote:
| Llms attack certain topics so removing one site will allow
| the others to live on the same server.
| Den_VR wrote:
| You can limit access, sure: with ACLs, putting content behind
| login, certificate based mechanisms, and at the end of the
| day -a power cord-.
|
| But really, controlling which automated retrieval tools are
| allowed has always been more of a code of honor than a
| technical control. And that trust you mention has always been
| broken. For as long as I can remember anyway. Remember
| LexiBot and AltaVista?
| fluidcruft wrote:
| If the AI archives/caches all the results it accesses and
| enough people use it, doesn't it become a scraper? Just learn
| off the cached data. Being the man-in-the-middle seems like a
| pretty easy way to scrape salient content while also getting
| signals about that content's value.
| JimDabell wrote:
| No. The key difference is that if a user asks about a
| specific page, when Perplexity fetches that page, it is being
| operated by a human not acting as a crawler. It doesn't
| matter how many times this happens or what they do with the
| result. If they aren't recursively fetching pages, then they
| aren't a crawler and robots.txt does not apply to them.
| robots.txt is not a generic access control mechanism, it is
| designed _solely_ for automated clients.
| sbarre wrote:
| I would only agree with this if we knew for sure that these
| on-demand human-initiated crawls didn't result in the
| crawled page being added to an overall index and scheduled
| for future automated crawls.
|
| Otherwise it's just adding an unwilling website to a crawl
| index, and showing the result of the first crawl as a
| byproduct of that action.
| fluidcruft wrote:
| Many people don't want their data used for free/any
| training. AI developers have been so repeatedly unethical
| that the well-earned Baysian prior is high probability that
| you cannot trust AI developers to not cross the
| training/inference streams.
| JimDabell wrote:
| > Many people don't want their data used for free/any
| training.
|
| That is true. But robots.txt is not designed to give them
| the ability to prevent this.
| gunalx wrote:
| It is in the name, rules for the robots. Any scraping ai
| or not, and even mass recrsive or single page, should
| abide by the rules.
| glenstein wrote:
| > It doesn't matter how many times this happens or what
| they do with the result.
|
| That's where you lost me, as this is key to GP's point
| above and it takes more than a mere out-of-left-field
| declaration that "it doesn't matter" to settle the question
| of whether it matters.
|
| I think they raised an important point about using cached
| data to support functions beyond the scope of simple at-
| request page retrieval.
| gruez wrote:
| >If the AI archives/caches all the results it accesses and
| enough people use it, doesn't it become a scraper?
|
| That's basically how many crowdsourced crawling/archive
| projects work. For instance, sci-hub and RECAP[1]. Do you
| think they should be shut down as well? In both cases there's
| even a stronger justification to shutting them down, because
| the original content is paywalled and you could plausibly
| argue there's lost revenue on the line.
|
| [1] https://en.wikipedia.org/wiki/Free_Law_Project#RECAP
| fluidcruft wrote:
| I didn't suggest Perplexity should be shut down, though.
| And yes, in your analogy sites are completely justified to
| take whatever actions they can to block people who are
| building those caches.
| a2128 wrote:
| In theory retrieving a page on behalf of a user would be
| acceptable, but these are AI companies who have disregarded all
| norms surrounding copyright, etc. It would be stupid of them
| not to also save contents of the page and use it for future AI
| training or further crawling
| zarzavat wrote:
| If you allow Googlebot to crawl your website and train
| Gemini, but you don't allow smaller AI companies to do the
| same thing, then you're contributing to Google's hegemony.
| Given that AI is likely to be an increasingly important part
| of society in the future, that kind of discrimination is
| anti-social. I don't want a future where everything is run by
| Google even more than it currently is.
|
| Crawling is legal. Training is presumably legal. Long may the
| little guys do both.
| dgreensp wrote:
| Googlebot respects robots.txt. And Google doesn't use the
| fetched data from users of Chrome to supplement their
| search index (as a2128 is speculating that Perplexity might
| do when they fetch pages on the user's behalf).
| foota wrote:
| Yes, but there's no way to say "allow indexing for
| search, but not for AI use", right?
| warkdarrior wrote:
| But there is:
| https://developers.google.com/search/docs/crawling-
| indexing/...
|
| There is an user agent for search that you can control in
| robots.txt. user-agent: Googlebot
|
| There is another user agent for AI training.
| user-agent: Google-Extended
| throwanem wrote:
| The HTTP spec draws such a distinction, albeit implicitly, in
| the form (and name) of its concept of "user agent."
| alexey-salmin wrote:
| Over time it degraded into declaring compatibility with a
| bunch of different browser engines and doesn't reflect the
| actual agent anymore.
|
| And very likely Perplexity is in fact using a Chrome-
| compatible engine to render the page.
| throwanem wrote:
| The header to which you refer was named for the concept.
| busymom0 wrote:
| The examples the article cites seem to me that they are merely
| retrieving content on behalf of the user. I do not see a
| problem with this.
| fxtentacle wrote:
| I find this problem quite difficult to solve:
|
| 1. If I as a human request a website, then I should be shown the
| content. Everyone agrees.
|
| 2. If I as the human request the software on my computer to
| modify the content before displaying it, for example by
| installing an ad-blocker into my user agent, then that's my
| choice and the website should not be notified about it. Most
| users agree, some websites try to nag you into modifying the
| software you run locally.
|
| 3. If I now go one step further and use an LLM to summarize
| content because the authentic presentation is so riddled with
| ads, JavaScript, and pop-ups, that the content becomes borderline
| unusable, then why would the LLM accessing the website on my
| behalf be in a different legal category as my Firefox web browser
| accessing the website on my behalf?
| Beijinger wrote:
| How about I open a proxy, replace all ads with my ads, redirect
| the content to you and we share the ad revenue?
| fxtentacle wrote:
| That's somewhat antisocial, but perfectly legal in the US.
| It's called PayPal Honey, for example, and has been running
| for 13 years now.
| rustc wrote:
| Since when does PayPal Honey replace ads on websites?
|
| > PayPal Honey is a browser extension that automatically
| finds and applies coupon codes at checkout with a single
| click.
| carlosjobim wrote:
| That's the Brave browser.
| zeta0134 wrote:
| If the LLM were running this sort of thing at the user's
| explicit request this would be fine. The problem is training.
| Every AI startup on the planet right now is aggressively
| crawling everything that will let them crawl. The server isn't
| seeing occasional summaries from interested users, but
| thousands upon thousands of bots repeatedly requesting every
| link they can find as fast as they can.
| mnmalst wrote:
| But that's not what this article is about. From, what I
| understand, this articles is about a user requesting
| information about a specific domain and not general scraping.
| fxtentacle wrote:
| Then what if I ask the LLM 10 questions about the same domain
| and ask it to research further? Any human would then click
| through 50 - 100 articles to make sure they know what that
| domain contains. If that part is automated by using an LLM,
| does that make any legal change? How many page URLs do you
| think one should be allowed to access per LLM prompt?
| zeta0134 wrote:
| All of them. That's at the explicit request of the user.
| I'm not sure where the downvotes are coming from, since I
| agree with all of these points. The training thing has
| merely _pissed off_ lots of server operators already, so
| they quite reasonably tend to block first and ask questions
| later. I think that 's important context.
| hombre_fatal wrote:
| TFA isn't talking about crawling to harvest training data.
|
| It's talking about Perplexity crawling sites on demand in
| response to user queries and then complaining that no it's
| not fine, hence this thread.
| cjonas wrote:
| Doesn't perplexity crawl to harvest and index data like a
| traditional search engine? Or is it all "on demand"?
| lukeschlather wrote:
| For the most part I would assume they pay for access to
| Google or Bing's index. I also assume they don't really
| train models. So all their "crawling" is on behalf of
| users.
| bbqfog wrote:
| Correct, it's user hostile to dictate which software is allowed
| to see content.
| klabb3 wrote:
| They all do it. Facebook, Reddit, Twitter, Instagram. Because
| it interferes with their business model. It was already bad,
| but now the conflict between business and the open web is
| reaching unprecedented levels, especially since the copyright
| was scrapped for AI companies.
| Workaccount2 wrote:
| >2. If I as the human request the software on my computer to
| modify the content before displaying it, for example by
| installing an ad-blocker into my user agent, then that's my
| choice and the website should not be notified about it. Most
| users agree, some websites try to nag you into modifying the
| software you run locally.
|
| If I put time and effort into a website and it's content, I
| should expect no compensation despite bearing all costs.
|
| Is that something everyone would agree with?
|
| The internet should be entirely behind paywalls, besides
| content that is already provided ad free.
|
| Is that something everyone would agree with?
|
| I think the problem you need to be thinking about is "How can
| the internet work if no one wants to pay anything for
| anything?"
| Bjartr wrote:
| You're free to deny access to your site arbitrarily,
| including for lack of compensation.
| Workaccount2 wrote:
| >and the website should not be notified about it.
| giantrobot wrote:
| My user agent and its handling of your content once it's
| on my computer are not your concern. You don't need to
| know if the data is parsed by a screen reader, an AI
| agent, or just piped to /dev/null. It's simply not your
| concern and never will be.
| cjonas wrote:
| Like for people or are using a ad block or for a crawler
| downloading your content so it can be used by an AI
| response?
| Bjartr wrote:
| Arbitrarily, as in for any reason. It's your site, you
| decide what constraints an incoming request must meet for
| it to get a response containing the content of your site.
| ndiddy wrote:
| This article is about Cloudflare attempting to deny
| Perplexity access to their demo site by blocking
| Perplexity's declared user-agent and official IP range.
| Perplexity responded to this denial by impersonating Google
| Chrome on macOS and rotating through IPs not listed in
| their published IP range to access the site anyway. This
| means it's not just "you're free to deny access to your
| site arbitrarily", it's "you're free to play a cat-and-
| mouse game indefinitely where the other side is a giant
| company with hundreds of millions of dollars in VC
| funding".
| Bjartr wrote:
| The comment I'm responding to established a slightly
| different context by asking a specific question about
| getting compensation from site visitors.
| nradov wrote:
| Yes, I agree with that. If a website owner expects
| compensation then they should use a paywall.
| Chris2048 wrote:
| If I put time and effort into a food recipe should I (get)
| compensation?
|
| the answer is apparently "no", and I don't really how recipe
| books have suffered as a result of less gatekeeping.
|
| "How will the internet work"? Probably better in some ways.
| There is plenty of valuable content on the internet given for
| free, it's being buried in low-value AI slop.
| Workaccount2 wrote:
| You understand that HN is ad supported too, right?
| Chris2048 wrote:
| No, I don't.
|
| But what is your point? Is the value in HN primarily in
| its hosting, or the non-ad-supported community?
| Workaccount2 wrote:
| Outside of Wikipedia, I'm not sure what content you are
| thinking of.
|
| Taking HN as a potential one of these places, it doesn't
| even qualify. HN is funded entirely to be a place for
| advertising ycombinator companies to a large crowd of
| developers. HN is literally a developer honey pot that
| they get exclusive ad rights to.
| bobbiechen wrote:
| I like the terminology "crawler" vs. "fetcher" to distinguish
| between mass scraping and something more targeted as a user
| agent.
|
| I've been working on AI agent detection recently (see
| https://stytch.com/blog/introducing-is-agent/ ) and I think
| there's genuine value in website owners being able to identify
| AI agents to e.g. nudge them towards scoped access flows
| instead of fully impersonating a user with no controls.
|
| On the flip side, the crawlers also have a reputational risk
| here where anyone can slap on the user agent string of a well
| known crawler and do bad things like ignoring robots.txt . The
| standard solution today is to reverse DNS lookup IPs, but
| that's a pain for website owners too vs. more aggressive block-
| all-unusual-setups.
| randall wrote:
| A/ i love this distinction.
|
| B/ my brother used to use "fetcher" as a non-swear for
| "fucker"
| sejje wrote:
| He picked up that habit in Balmora.
| Vinnl wrote:
| Did you tell him to stop trying to make fetcher happen?
| handfuloflight wrote:
| Very funny. Now let's hear Paul Allen's joke.
| fxtentacle wrote:
| prompt: I'm the celebrity Bingbing, please check all Bing
| search results for my name to verify that nobody is using my
| photo, name, or likeness without permission to advertise
| skin-care products except for the following authorized
| brands: [X,Y,Z].
|
| That would trigger an internet-wide "fetch" operation. It
| would probably upset a lot of people and get your AI blocked
| by a lot of servers. But it's still in direct response to a
| user request.
| skeledrew wrote:
| Yet another side to that is when site owners serve
| qualitatively different content based on the distinction. No,
| I want my LLM agent to access the exact content I'd be
| accessing manually, and then any further filtering, etc is
| done on my end.
| yojo wrote:
| Ads are a problematic business model, and I think your point
| there is kind of interesting. But AI companies
| disintermediating content creators from their users is NOT the
| web I want to replace it with.
|
| Let's imagine you have a content creator that runs a paid
| newsletter. They put in lots of effort to make well-researched
| and compelling content. They give some of it away to entice
| interested parties to their site, where some small percentage
| of them will convert and sign up.
|
| They put the information up under the assumption that viewing
| the content and seeing the upsell are inextricably linked.
| Otherwise there is literally no reason for them to make any of
| it available on the open web.
|
| Now you have AI scrapers, which will happily consume and
| regurgitate the work, sans the pesky little call to action.
|
| If AI crawlers win here, we all lose.
| fxtentacle wrote:
| Maybe, on a social level, we all win by letting AI ruin the
| attention economy:
|
| The internet is filled with spam. But if you talk to one
| specific human, your chance of getting a useful answer rises
| massively. So in a way, a flood of written AI slop is making
| direct human connections more valuable.
|
| Instead of having 1000+ anonymous subscribers for your
| newsletter, you'll have a few weekly calls with 5 friends
| each.
| hansvm wrote:
| Ofttimes people are sufficiently anti-ad that this point
| won't resonate well. I'm personally mostly in that camp in
| that with relatively few exceptions money seems to make the
| parts of the web I care about worse (it's hard to replace
| passion, and wading through SEO-optimized AI drivel to find a
| good site is a lot of work). Giving them concrete examples of
| sites which would go away can help make your point.
|
| E.g., Sheldon Brown's bicycle blog is something of a work of
| art and one of the best bicycle resources literally anywhere.
| I don't know the man, but I'd be surprised if he'd put in the
| same effort without the "brand" behind it -- thankful readers
| writing in, somebody occasionally using the donate button to
| buy him a coffee, people like me talking about it here, etc.
| blacksmith_tb wrote:
| Sheldon died in 2008, but there's no doubt that all the
| bicycling wisdom he posted lives on!
| wulfstan wrote:
| He's that widely respected that amongst those who repair
| bikes (I maintain a fleet of ~10 for my immediate family)
| he is simply known as "Saint Sheldon".
| vertoc wrote:
| But even your example gets worse with AI potentially - the
| "upsell" of his blog isn't paid posts but more subscribers
| so there will be thankful readers, a few donators, people
| talking about it. If the only interface becomes an AI
| summary of his work without credit, it's much more likely
| he stops writing as it'll seem like he's just screaming
| into the void
| hansvm wrote:
| I don't think we're disagreeing?
| yojo wrote:
| I agree that specific examples help, though I think the
| ones that resonate most will necessarily be niche. As a
| teen, I loved Penny Arcade, and watched them almost die
| when the bottom fell out of the banner-ad market.
|
| Now, most of the value I find in the web comes from niche
| home-improvement forums (which Reddit has mostly digested).
| But even Reddit has a problem if users stop showing up from
| SEO.
| bombela wrote:
| > Sheldon Brown (July 14, 1944 - February 4, 2008)
| bee_rider wrote:
| I think it's basically impossible to prevent AI crawlers. It
| is like video game cheating, at the extreme they could
| literally point a camera at the screen and have it do image
| processing, and talk to the computer through the USB port
| emulating, a mouse and keyboard outside the machine. They
| don't do that, of course, because it is much easier to do it
| all in software, but that is the ultimate circumvention of
| any attempt to block them out that doesn't also block out
| humans.
|
| I think the business model for "content creating" is going
| have to change, for better or worse (a lot of YouTube stars
| are annoying as hell, but sure, stuff like well-written news
| and educational articles falls under this umbrella as well,
| so it is unfortunate that they will probably be impacted
| too).
| yojo wrote:
| I don't subscribe to technological inevitabilism.
|
| Cloudflare banning bad actors has at least made scraping
| more expensive, and changes the economics of it - more
| sophisticated deception is necessarily more expensive. If
| the cost is high enough to force entry, scrapers might be
| willing to pay for access.
|
| But I can imagine more extreme measures. e.g. old web of
| trust style request signing[0]. I don't see any easy way
| for scrapers to beat a functioning WOT system. We just
| don't happen to have one of those yet.
|
| 0: https://en.m.wikipedia.org/wiki/Web_of_trust
| Spivak wrote:
| It is inevitable, not because of some technological
| predestination but because if these services get hard-
| blocked and unable to perform their duties they will ship
| the agent as a web browser or browser add-on just like
| all the VSCode forks and then the requests will happen
| locally through the same pipe as the user's normal
| browser. It will be functionally indistinguishable from
| normal web traffic since it will be normal web traffic.
| skeledrew wrote:
| Then personal key sharing will become a thing, similar to
| BugMeNot et al.
| immibis wrote:
| Beating web of trust is actually pretty easy: pay people
| to trust you.
|
| Yes, you can identify who got paid to sign a key and ban
| them. They will create another key, go to someone else,
| pretend to be someone not yet signed up for WoT (or pay
| them), and get their new key signed, and sign more keys
| for money.
|
| So many people will agree to trust for money, and
| accountability will be so diffuse, that you won't be able
| to ban them all. Even you, a site operator, would accept
| enough money from OpenAI to sign their key, for a promise
| the key will only be used against your competitor's site.
|
| It wouldn't take a lot to make a binary-or-so tree of
| fake identities, with exponential fanout, and get some
| people to trust random points in the tree, and use the
| end nodes to access your site.
|
| Heck, we even have a similar problem right now with IP
| addresses, and not even with very long trust chains. You
| are "trusted" by your ISP, who is "trusted" by one of the
| RIRs or from another ISP. The RIRs trust each other and
| you trust your local RIR (or probably all of them). We
| can trace any IP to see who owns it. But is that useful,
| or is it pointless because all actors involved make money
| off it? You know, when we tried making IPs more
| identifying, all that happened is VPN companies sprang up
| to make money by leasing non-identifying IPs. And most
| VPN exits don't show up as owned by the VPN company,
| because they'd be too easy to identify as non-
| identifying. They pay hosting providers to use their IPs.
| Sometimes they even pay residential ISPs so you can't
| even go by hosting provider. The original Internet _was_
| a web of trust (represented by physical connectivity),
| but that 's long gone.
| bee_rider wrote:
| > Cloudflare banning bad actors has at least made
| scraping more expensive, and changes the economics of it
| - more sophisticated deception is necessarily more
| expensive. If the cost is high enough to force entry,
| _scrapers might be willing to pay for access._
|
| I think this might actually point at the end state.
| Scraping bots will eventually get good enough to emulate
| a person well enough to be indistinguishable (are we
| there yet?). Then, content creators will have to price
| their content appropriately. Have a Patreon, for example,
| where articles are priced at the price where the creator
| is fine with having people take that content and add it
| to the model. This is essentially similar to studios
| pricing their content appropriately... for Netflix to buy
| it and broadcast it to many streaming users.
|
| Then they will have the problem of making sure their
| business model is resistant to non-paying users. Netflix
| can't stop me from pointing a camcorder at my TV while
| playing their movies, and distributing it out like that.
| But, somehow, that fact isn't catastrophic to their
| business model for whatever reason, I guess.
|
| Cloudflare can try to ban bad actors. I'm not sure if it
| is cloudflare, but as someone who usually browses without
| JavaScript enables I often bump into "maybe you are a
| bot" walls. I recognize that I'm weird for not running
| JavaScript, but eventually their filters will have the
| problem where the net that captures bots also captures
| normal people.
| subspeakai wrote:
| This is the fascinating case where I think this all goes -
| At some point costs come down and you can do this and
| bypass everything
| shadowgovt wrote:
| > Otherwise there is literally no reason for them to make any
| of it available on the open web
|
| This is the hypothesis I always personally find fascinating
| in light of the army of semi-anonymous Wikipedia volunteers
| continuously gathering and curating information without pay.
|
| If it became functionally impossible to upsell a little
| information for more paid information, I'm sure _some_ people
| would stop creating information online. I don 't know if it
| would be enough to fundamentally alter the character of the
| web.
|
| Do people (generally) put things online to get money or
| because they want it online? And is "free" data worse quality
| than data you have to pay somebody for (or is the challenge
| more one of curation: when anyone can put anything up for
| free, sorting high- and low-quality based on whatever
| criteria becomes a new kind of challenge?).
|
| Jury's out on these questions, I think.
| yojo wrote:
| Any information that requires something approximating a
| full-time job worth of effort to produce will necessarily
| go away, barring the small number of independently wealthy
| creators.
|
| Existing subject-matter experts who blog for fun may or may
| not stick around, depending on what part of it is "fun" for
| them.
|
| While some must derive satisfaction from increasing the
| total sum of human knowledge, others are probably blogging
| to engage with readers or build their own personal brand,
| neither of which is served by AI scrapers.
|
| Wikipedia is an interesting case. I still don't entirely
| understand why it works, though I think it's telling that
| 24 years later no one has replicated their success.
| SoftTalker wrote:
| Wikipedia works for the same reason open-source does:
| because most of the contributors are experts in the
| subject and have paid jobs in that field. Some are also
| just enthusiasts.
| ndriscoll wrote:
| OpenStreetMap is basically Wikipedia for maps and is
| quite successful. Over 10M registered users and millions
| of edits per day. Lots of information is also shared
| online on forums for free. The hosting (e.g. reddit) is
| basically a commodity that benefits from network effects.
| The information is the more interesting bit, and people
| share it because they feel like it.
| johnfn wrote:
| Unless I am misunderstanding you, you are talking about
| something different than the article. The article is talking
| about web-crawling. You are talking about local / personal LLM
| usage. No one has any problems with local / personal LLM usage.
| It's when Perplexity uses web crawlers that an issue arises.
| lukeschlather wrote:
| You probably need a computer that costs $250,000 or more to
| run the kind of LLM that Perplexity uses, but with batching
| it costs pennies to have the same LLM fetch a page for you,
| summarize the content, and tell you what is on it. And the
| power usage similarly, running the LLM for a single user will
| cost you a huge amount of money relative to the power it
| takes in a cloud environment with many users.
|
| Perplexity's "web crawler" is mostly operating like this on
| behalf of users, so they don't need a massively expensive
| computer to run an LLM.
| st3fan wrote:
| Is the article really talking about crawling? Because in one
| of their screenshots where they ask information about the
| "honeypot" website you can see that the model requested pages
| from the website. But that is most definitely "fetching by
| proxy because I asked a question about the website" and not
| random crawling.
|
| It is confusing.
| sbarre wrote:
| All of these scenarios assume you have an unconditional right
| to access the content on a website in whatever way you want.
|
| Do you think you do?
|
| Or is there a balance between the owner's rights, who bears the
| content production and hosting/serving costs, and the rights of
| the end user who wishes to benefit from that content?
|
| If you say that you have the right, and that right should be
| legally protected, to do whatever you want on your computer,
| should the content owner not also have a legally protected
| right to control how, and by who, and in what manner, their
| content gets accessed?
|
| That's how it currently works in the physical world. It doesn't
| work like that in the digital world due to technical
| limitations (which is a different topic, and for the record I
| am fine with those technical limitations as they protect other
| more important rights).
|
| And since the content owner is, by definition, the owner of the
| content in question, it feels like their rights take
| precedence. If you don't agree with their offering (i.e. their
| terms of service), then as an end user you don't engage, and
| you don't access the content.
|
| It really can be that simple. It's only "difficult to solve" if
| you don't believe a content owner's rights are as valid as your
| own.
| cutemonster wrote:
| If there's an article you want to read, and the ToS says that
| in between reading each paragraph, you must switch to their
| YouTube channel and look at their ads about cat food for 5
| minutes, are your going to do that?
| JimDabell wrote:
| Hacker News has collectively answered this question by
| consistently voting up the archive.is links in the comments
| of every paywalled article posted here.
| giantrobot wrote:
| New sites have collectively decided to require people use
| those services because they can't fathom _not_
| enshittifying everything until it 's an unusable
| transaction hellscape.
|
| I never really minded magazine ads or even television
| ads. They might have tried to make me associate boobs
| with a brand of soda but they didn't data mine my life
| and track me everywhere. I'd much rather have old
| fashioned manipulation than pervasive and dangerous
| surveillance capitalism.
| gruez wrote:
| >Or is there a balance between the owner's rights, who bears
| the content production and hosting/serving costs, and the
| rights of the end user who wishes to benefit from that
| content?
|
| If you believe in this principle, fair enough, but are you
| going to apply this consistently? If it's fair game for a
| blog to restrict access to AI agents, what does that mean for
| other user agents that companies disagree with, like browsers
| with adblock? Does it just boil down to "it's okay if a
| person does it but not okay if a big evil corporation does
| it?"
| hansvm wrote:
| It doesn't work like that in the physical world though. Once
| you've bought a book the author can't stipulate that you're
| only allowed to read it with a video ad in the sidebar, by
| drinking a can of coke before each chapter, or by giving them
| permission to sniff through your family's medical history.
| They can't keep you from loaning it out for other people to
| read, even thousands of other people. They can't stop you
| from reading it in a certain room or with your favorite music
| playing. You can even have an LLM transcribe or summarize it
| for you for personal use (not everyone has those automatic
| page flipping machines, but hypothetically).
|
| The reason people are up in arms is because rights they
| previously enjoyed are being stripped away by the current
| platforms. The content owner's rights aren't as valid as my
| own in the current world; they trump mine 10 to 1. If I "buy"
| a song and the content owner decides that my country is
| politically unfriendly, they just delete it and don't refund
| me. If I request to view their content and they start by
| wasting my bandwidth sending me an ad I haven't consented to,
| how can I even "not engage"? The damage is done, and there's
| no recourse.
| jasonjmcghee wrote:
| I think it's an issue of scale.
|
| The next step in your progression here might be:
|
| If / when people have personal research bots that go and look
| for answers across a number of sites, requesting many pages
| much faster than humans do - what's the tipping point? Is
| personal web crawling ok? What if it gets a bit smarter and
| tried to anticipate what you'll ask and does a bunch of
| crawling to gather information regularly to try to stay up to
| date on things (from your machine)? Or is it when you tip the
| scale further and do general / mass crawling for many users to
| consume that it becomes a problem?
| cj wrote:
| Doesn't o3 sort of already do this? Whenever I ask it
| something, it makes it look like it simultaneously opens 3-8
| pages (something a human can't do).
|
| Seems like a reasonable stance would be something like
| "Following the no crawl directive is especially necessary
| when navigating websites faster than humans can."
|
| > What if it gets a bit smarter and tried to anticipate what
| you'll ask and does a bunch of crawling to gather information
| regularly to try to stay up to date on things (from your
| machine)?
|
| To be fair, Google Chrome already (somewhat) does this by
| preloading links it thinks you might click, before you click
| it.
|
| But your point is still valid. We tolerate it because as
| website owners, we want our sites to load fast for users. But
| if we're just serving pages to robots and the data is
| repackaged to users without citing the original source, then
| yea... let's rethink that.
| Spivak wrote:
| You don't middle click a bunch of links when doing
| research? Of all the things to point to I wouldn't have
| thought "opens a bunch of tabs" to be one of the
| differentiating behaviors between browsing with Firefox and
| browsing with an LLM.
| fauigerzigerk wrote:
| _> Doesn't o3 sort of already do this?_
|
| ChatGPT probably uses a cache though. Theoretically, the
| average load on the original sites could be far less than
| users accessing them directly.
| fxtentacle wrote:
| Maybe we should just institutionalize and explicitly legalize
| the Internet Archive and Archive Team. Then, I can download a
| complete and halfway current crawl of domain X from the IA
| and that way, no additional costs are incurred for domain X.
|
| But of course, most website publishers would hate that.
| Because they don't want people to access their content, they
| want people to look at the ads that pay them. That's why to
| them, the IA crawling their website is akin to stealing.
| Because it's taking away some of their ad impressions.
| ivape wrote:
| Or websites can monetize their data via paid apis and
| downloadable archives. That's what makes Reddit the most
| valuable data trove for regular users.
| ccgreg wrote:
| I don't think Reddit pays the people who voluntarily
| write Reddit content. Valuable to Reddit, I guess.
| palmfacehn wrote:
| https://commoncrawl.org/
|
| >Common Crawl maintains a free, open repository of web
| crawl data that can be used by anyone.
| stanmancan wrote:
| I have mixed feelings on this.
|
| Many websites (especially the bigger ones) are just
| businesses. They pay people to produce content, hopefully
| make enough ad revenue to make a profit, and repeat.
| Anything that reproduces their content and steals their
| views has a direct effect on their income and their ability
| to stay in business.
|
| Maybe IA should have a way for websites to register to
| collect payment for lost views or something. I think it's
| negligible now, there are likely no websites losing
| _meaningful_ revenue from people using IA instead, but it
| might be a way to get better buy in if it were
| institutionalized.
| like_any_other wrote:
| If magazines and newspapers were once able to be funded
| by native ads, so can websites. The spying industry
| doesn't want you to know this, but ads work without
| spying too - just look at all the IRL billboards still
| around.
| tr_user wrote:
| I saw someone suggest in another post, if only one crawler
| was visiting and scraping and everyone else reused from that
| copy I think most websites would be ok with it. But the
| problem is every billionaire backed startup draining your
| resources with something similar to a DOS attack.
| Spacecosmonaut wrote:
| Regarding point 3: The problem from the perspective of websites
| would not be any different if they had been completely ad free.
| People would still consume LLM generated summaries because they
| cut down clicks and eyeballing to present you information that
| directly pertains to the promt.
|
| The whole concept of a "website" will simply become niche. How
| many zoomers still visit any but the most popular websites?
| ai-christianson wrote:
| Websites should be able to request payment. Who cares if it is
| a human or an agent of a human if it is paying for the request?
| carlosjobim wrote:
| They are able to request payment.
| sds357 wrote:
| What if the agent is reselling the request?
| adriand wrote:
| Cloudflare launched a mechanism for this:
| https://blog.cloudflare.com/introducing-pay-per-crawl/
| pyrale wrote:
| Because LLM companies have historically been extremely
| disingenuous when it comes to crawling these sites.
|
| Also because there is a difference between a user hitting f5 a
| couple times and a crawler doing a couple hundred requests.
|
| Also because ultimately, by intermediating the request, llm
| companies rob website owners of a business model. A newspaper
| may be fine letting adblockers see their article, in hopes that
| they may eventually subscribe. When a LLM crawls the info and
| displays it with much less visibility for the source, that hope
| may not hold.
| fluidcruft wrote:
| In theory, couldn't the LLM access the content on your browser
| and it's cache, rather than interacting with the website
| directly? Browser automation directly related to user activity
| (prefetch etc) seems qualitatively different to me. Similarly,
| refusing to download content or modifying content after it's
| already in my browser is also qualitatively different. That all
| seems fair-use-y. I'm not sure there's a technical solution
| beyond the typical cat/mouse wars... but there is a smell when
| a datacenter pretends to be a person. That's not a browser.
|
| It could be a personal knowledge management system, but it
| seems like knowledge management systems should be operating off
| of things you already have. The research library down the
| street isn't considered a "personal knowledge management
| system" in any sense of the term if you know what I mean. If
| you dispatch an army of minions to take notes on the library's
| contents that doesn't seem personal. Similarly if you dispatch
| the army of minions to a bookstore rather than a library. At
| the very least bring the item into your house/office first.
| (Libraries are a little different because they are designed for
| studying and taking notes, it's use of an army of minions
| aspect)
| freehorse wrote:
| > couldn't the LLM access the content on your browser
|
| Yes, orbit, a now deprecated firefox extension by mozilla was
| doing that. This way you could also use it to summarise
| content that would not be available to a third party (eg sth
| in google docs).
|
| You can still sort of do the same with the ai chatbot panel
| in firefox, sort of, but ctrl+A>right click>AI
| chatbot>summarise.
| beardyw wrote:
| You speak as 1% of the population to 1% of the population.
| Don't fool yourself.
| porridgeraisin wrote:
| I don't think people have a problem with an LLM issuing GET
| website.com and then summarising that, each and every time it
| uses that information (or atleast, save a citation to it and
| refer to that citation). Except ad ecosystem, ignoring them for
| now, please refer to last paragraph.
|
| The problem is with the LLM then training on that data _once_
| and then storing it forever and regurgitating it N times in the
| future without ever crediting the original author.
|
| So far, humans themselves did this, but only for relatively
| simple information (ratio of rice and water in specific
| $recipe). You're not gonna send a link to your friend just to
| see the ratio, you probably remember it off the top of your
| head.
|
| Unfortunately, the top of an LLMs head is pretty big, and they
| are fitting almost the entire website's content in there for
| most websites.
|
| The threshold beyond which it becomes irreproducible for human
| consumers, and therefore, copyrightable (lot of copyright law
| has "reasonable" term which refers to this same concept) has
| now shifted up many many times higher.
|
| Now, IMO:
|
| So far, for stuff that won't fit in someone's head, people were
| using citations (academia, for example). LLMs should also use
| citations. That solves the ethical problem pretty much. That
| the ad ecosystem chose views as the monetisation point and is
| thus hurt by this is not anyone else's problem. The ad
| ecosystem can innovate and adjust to the new reality in their
| own time and with their own effort. I promise most people won't
| be waiting. Maybe google can charge per LLM citation. Cost Per
| Citation, you even maintain the acronym :)
| skydhash wrote:
| That's why websites have no issues with googlebot and the
| search results. It's a giant index and citation list. But
| stripping works from its context and presenting as your own
| is decried throughout history.
| wulfstan wrote:
| Yes, this is the crux of the matter.
|
| The "social contract" that has been established over the last
| 25+ years is that site owners don't mind their site being
| crawled reasonably provided that the indexing that results
| from it links back to their content. So when
| AltaVista/Yahoo/Google do it and then score and list your
| website, interspersing that with a few ads, then it's a
| sensible quid pro quo for everyone.
|
| LLM AI outfits are abusing this social contract by stuffing
| the crawled data into their models,
| summarising/remixing/learning from this content, claiming
| "fair use" and then not providing the quid pro quo back to
| the originating data. This is quite likely terminal for many
| content-oriented businesses, which ironically means it will
| also be terminal for those who will ultimately depend on
| additions, changes and corrections to that content - LLM AI
| outfits.
|
| IMO: copyright law needs an update to mandate no training on
| content without explicit permission from the holder of the
| copyright of that content. And perhaps, as others have
| pointed out, an llms.txt to augment robots.txt that covers
| this for llm digestion purposes.
|
| EDIT: Apparently llms.txt has been suggested, but from what I
| can tell this isn't about restricting access:
| https://llmstxt.org/
| giantrobot wrote:
| > LLM AI outfits are abusing this social contract by
| stuffing the crawled data into their models,
| summarising/remixing/learning from this content
|
| Let's be real, Google et al have been doing this for years
| with their quick answer and info boxes. AI chatbots are
| _worse_ but it 's not like the big search engines were
| great before AI came along. Google had made itself the one-
| stop shop for a huge percentage of users. They paid
| billions to be the default search engine on Apple's
| platforms not out of the goodness of their hearts but to be
| the main destination for everyone on the web.
| nelblu wrote:
| > LLMs should also use citations.
|
| Mojeek LLM (https://www.mojeek.com) uses citations.
| itsdesmond wrote:
| Some stores do not welcome Instacart or Postmates shoppers. You
| can shop there. You can shop with your phone out, scanning
| every item to price match, something that some bookstores frown
| on, for example. Third party services cannot send employees to
| index their inventory, nor can they be dispatched to pick up an
| item you order online.
|
| Their reasons vary. Some don't want their businesses perception
| of quality to be taken out of their control (delivering cold
| food, marking up items, poor substitutions). Some would prefer
| their staff service and build relationships with customers
| directly, instead of disinterested and frequently quite
| demanding runners. Some just straight up disagree with the
| practice of third party delivery.
|
| I think that it's pretty unambiguously reasonable to choose to
| not allow an unrelated business to operate inside of your
| physical storefront. I also think that maps onto digital
| services.
| rjbwork wrote:
| But I can send my personal shopper and you'll be none the
| wiser.
| bradleyjg wrote:
| It's possible to violate all sorts of social norms.
| Societies that celebrate people that do so are on the far
| opposite end of the spectrum from high trust ones. They are
| rather unpleasant.
| ToucanLoucan wrote:
| Just the Silicon Valley ethos extended to it's logical
| conclusions. These companies take advantage of public
| space, utilities and goodwill at industrial scale to
| "move fast and break things" and then everyone else has
| to deal with the ensuing consequences. Like how cities
| are awash in those fucking electric scooters now.
|
| Mind you I'm not saying electric scooters are a bad idea,
| I have one and I quite enjoy it. I'm saying we didn't
| need five fucking startups all competing to provide them
| at the lowest cost possible just for 2/3s of them to end
| up in fucking landfills when the VC funding ran out.
| SoftTalker wrote:
| My city impounded them and made them pay a fee to get
| them back. Now they have to pay a fee every year to be
| able to operate. Win/win.
| pixl97 wrote:
| Oh, this is a bunch of baloney.
|
| What you've pretty much stated is "You must go to the
| shops yourself so the ads and marketing can completely
| permeate your soul, and turn you into a voracious
| consumer.
|
| Businesses have the right to fuck completely and totally
| off a cliff taking their investor class with them in to
| the pit of the void. They lear at us from high places
| spending countless dollars on new ways to tell us we
| aren't good enough.
| Workaccount2 wrote:
| The only thing consumers have to do to get rid of ads
| permeating everything is pay for services in full
| directly. But they won't do that, because the only thing
| they hate more than ads, is paying with money instead.
| itsdesmond wrote:
| [flagged]
| dang wrote:
| Whoa, please don't post like this. We end up banning
| accounts that do.
|
| https://news.ycombinator.com/newsguidelines.html
| itsdesmond wrote:
| Aw, alright. I thought it was a funny way to make the
| point and I figured the yo momma structure was
| traditional enough to not be taken as a proper insult.
| Heard tho.
| rapind wrote:
| It's all about scale. The impact of your personal shopper
| is insignificant unless you manage to scale it up into a
| business where everyone has a personal shopper by default.
| mbrumlow wrote:
| Well then. Seems like you would be a fool to not allow
| personal shoppers then.
|
| The point is the web is changing, and people use a
| different type of browser now. Ans that browser happens
| to be LLMs.
|
| Anybody complaining about the new browser has just not
| got it yet, or has and is trying to keep things the old
| way because they don't know how or won't change with the
| times. We have seen it before, Kodak, blockbuster,
| whatever.
|
| Grow up cloud flare, some is your business models don't
| make sense any more.
| ToucanLoucan wrote:
| > Anybody complaining about the new browser has just not
| got it yet, or has and is trying to keep things the old
| way because they don't know how or won't change with the
| times. We have seen it before, Kodak, blockbuster,
| whatever.
|
| You say this as though all LLM/otherwise automated
| traffic is for the purposes of fulfilling a request made
| by a user 100% of the time which is just flatly on-its-
| face untrue.
|
| Companies make vast amounts of requests for indexing
| purposes. That _could_ be to facilitate user requests
| _someday,_ perhaps, but it is not today and not why it 's
| happening. And worse still, LLMs introduce a new third
| option: that it's not for indexing or for later linking
| but is instead either for training the language model
| itself, or for the model to ingest and regurgitate later
| on with no attribution, with the added fun that it might
| just make some shit up about whatever you said and be
| wrong. And as the person buying the web hosting, _all of
| that is subsidized by me._
|
| "The web is changing" does not mean every website must
| follow suit. Since I built my blog about 2 internet
| eternities ago, I have seen fad tech come and fad tech
| go. My blog remains more or less exactly what it was 2
| decades ago, with more content and a better stylesheet. I
| have requested in my robots.txt that my content not be
| used for LLM training, and I fully expect that to be
| ignored because tech bros don't respect anyone, even
| fellow tech bros, when it means they have to change their
| behavior.
| Imustaskforhelp wrote:
| Tech bros just respect money. Making money is very easy
| in the short term if you don't show ethics. Venture
| capitalism and the whole growth/indie hacking is focused
| around making money and making it fast.
|
| Its a clear road for disaster. I am honestly surprised by
| how great Hackernews is, to that comparison where most
| people are sharing it for the love of the craft as an
| example. And for that hackernews holds a special place in
| my heart. (Slightly exaggerating to give it a thematic
| ending I suppose)
| julkali wrote:
| Do not conflate your own experience with everyone else's.
| goatlover wrote:
| Some people use LLMs to search. Other people still prefer
| going to the actual websites. I'm not going to use an LLM
| to give me a list of the latest HN posts or NY Times
| articles, for example.
| nickthegreek wrote:
| How is everyone having a personal shopper a problem of
| scale? I was going to shop myself, but I sent someone
| else to do it for me.
|
| At this moment I am using Perplexity's Comet browser to
| take a spotify playlist and add all the tracks to my
| youtube music playlist. I love it.
| rapind wrote:
| I didn't use the word "problem". In fact I presented no
| opinion at all. I'm just pointing out that scale matters
| a lot. In fact, in tech, it's often the only thing that
| matters. It's naive (or narrative) to think it doesn't.
|
| Everyone having a personal shopper obviously changes the
| relationship to the products and services you use or
| purchase via personal shopper. Good, bad, whatever.
| SoftTalker wrote:
| We'll see more of this sort of thing as AI agents become
| more popular and capable. They will do things that the
| site or app should be able to do (or rather, things that
| _users want to be able to do_ ) but don't offer. The
| YouTube music playlist is a good example. One thing I'd
| like to be able to do is make a playlist of some specific
| artists. But you can't. You have to select specific
| songs.
|
| If sites want to avoid people using agents, they should
| offer the functionality that people are using the agents
| to accomplish.
| dylan604 wrote:
| Let's look at the opposite benefit to a store if a mom
| that would need to bring her 3 kids to the store vs that
| mom having a personal shopper. In this case, the personal
| shopper is "better" for the store as far as physical
| space. However, I'm sure the store would still rather
| have the mom and 3 kids physically in the store so that
| the kids can nag mom into buying unneeded items that are
| placed specifically to attract those kids' attention.
| pixl97 wrote:
| >o that the kids can nag mom into buying unneeded items
|
| Excellent. Personal shoppers are 'adblock for IRL'.
|
| >You owe the companies nothing. You especially don't owe
| them any courtesy. They have re-arranged the world to put
| themselves in front of you. They never asked for your
| permission, don't even start asking for theirs.
| 542354234235 wrote:
| True, and I would ask, what is your point? Is it that no
| rule can have 100% perfect enforcement? That all rules have
| a grey area if you look close enough? Was it just a
| "gotcha" statement meant to insinuate what the prior
| commenter said was invalid?
| Polizeiposaune wrote:
| To stretch the analogy to the breaking point: If you send
| 10,000 personal shoppers all at once to the same store just
| to check prices, the store's going to be rightfully annoyed
| that they aren't making sales because legit buyers can't
| get in.
| sublinear wrote:
| Too bad. Build a bigger store or publish this information
| so we don't need 10,000 personal shoppers. Was this not
| the whole point of having a website? Who distorted that
| simple idea into the garbage websites we have now?
| recursive wrote:
| Weird take. The store doesn't owe your personal shippers
| anything.
| the_real_cher wrote:
| In the same token the personal shoppers don't owe the
| store anything either.
| eddythompson80 wrote:
| Surely they owe them money for the goods and service, no?
| I thought that's how stores worked.
| the_real_cher wrote:
| Context friend. This article and entire comments sections
| is about questionable web page access. Context.
| eddythompson80 wrote:
| You're replying in a store metaphor thread though.
| Context matters.
| recursive wrote:
| Then they can't complain if they're barred entry.
| the_real_cher wrote:
| http is neutral. it's up to the client to ignore
| robots.txt
|
| You can block IP's at the host level but there's pretty
| easy ways around that with proxy networks.
| eddythompson80 wrote:
| > http is neutral.
|
| Who misled you with that statement?
| the_real_cher wrote:
| Http doesnt have emotions or thought last time I checked.
| eddythompson80 wrote:
| It seems that a 403 makes you sad though.
| the_real_cher wrote:
| iproyal.com makes me smile again
| drdaeman wrote:
| IETF?
| drdaeman wrote:
| That's fair, but if there's enough of supply and demand
| for this to get traction (and online shopping is bug, and
| autonomous agents are sort of trending), this conflict of
| interest paired with a no-compromise "we don't own you
| anything" attitude is bound to escalate in an arms race.
| And YMMV but I don't like where that race may possibly
| end.
|
| If store businesses at least partially relies on
| obscurity of information that can be solved through
| automated means (e.g. storefronts tend to push visitors
| towards products they don't want, and buyer agents are
| fighting that and looking for something buyers instructed
| them) just playing this cat and mouse game of blocking
| agents, finding workarounds, and repeating the cycle is
| only creating perverse technological contraptions that
| neither party is really interested in - but both are
| circumstantially forced to invest into.
| dabockster wrote:
| > Who distorted that simple idea into the garbage
| websites we have now?
|
| Corporate America. Where clean code goes to die.
| hombre_fatal wrote:
| Your comment and the above comment of course show
| different cases.
|
| An agent making a request on the explicit behalf of
| someone else is probably something most of us agree is
| reasonable. "What are the current stories on Hacker
| News?" -- the agent is just doing the same request to the
| same website that I would have done anyways.
|
| But the sort of non-explicit just-in-case crawling that
| Perplexity might do for a general question where it
| crawls 4-6 sources isn't as easy to defend. "Are polar
| bears always white?" -- Now it's making requests I
| wouldn't have necessarily made, and it could even been
| seen as a sort of amplification attack.
|
| That said, TFA's example is where they register
| secretexample.com and then ask Perplexity "what is
| secretexample.com about?" and Perplexity sends a request
| to answer the question, so that's an example of the first
| case, not the second.
| bayindirh wrote:
| As a person who has a couple of sites out there, and
| witnesses AI crawlers coming and fetching pages from
| these sites, I have a question:
|
| What prevents these companies from keeping a copy of that
| particular page, which I specifically disallowed for bot
| scraping, and feed it to their next training cycle?
|
| Pinky promises? Ethics? Laws? Technical limitations?
| Leeroy Jenkins?
| tempfile wrote:
| The way to prevent people from downloading your pages and
| using them is to take them off the public internet. There
| are laws to prevent people from violating your copyright
| or from preventing access to your service (by excessive
| traffic). But there is (thankfully) no magical right that
| stops people from reading your content and describing it.
| bayindirh wrote:
| Many site operators want people to access their content,
| but prevent AI companies from scraping their sites for
| training data. People who think like that made tools like
| Anubis, and it works.
|
| I also want to keep this distinction on the sites I own.
| I also use licenses to signal that this site is not good
| to use for AI training, because it's CC BY-NC-SA-2.0.
|
| So, I license my content appropriately (No derivative,
| Non-commercial, shareable with the same license with
| attribution), add technical countermeasures on top,
| because companies doesn't _respect_ these licenses
| (because monies), and circumvent these mechanisms
| (because monies), and I 'm the one to suck this up and
| shut-up (because _their_ monies)?
|
| Makes no sense whatsoever.
| hombre_fatal wrote:
| I guess that's a question that might be answered by the
| NYT vs OpenAI lawsuit at least on the enforceability of
| copyright claims if you're a corporation like NYT.
|
| If you don't have the funds to sue an AI corp, I'd
| probably think of a plan B. Maybe poison the data for
| unauthenticated users. Or embrace the inevitability. Or
| see the bright side of getting embedded in models as if
| you're leaving your mark.
| tempfile wrote:
| Of course some people want that. And at the moment they
| can prevent it. But those methods may stop working. Will
| it then be alright to do it? Of course not, so why bother
| mentioning that they are able to prevent it now - just
| give a justification.
|
| Your license is probably not relevant. I can go to the
| cinema and watch a movie, then come on this website and
| describe the whole plot. That isn't copyright
| infringement. Even if I told it to the whole world, it
| wouldn't be copyright infringement. Probably the movie
| seller would prefer it if I didn't tell anyone. Why
| should I care?
|
| I actually agree that AI companies are generally bad and
| should be stopped - because they use an exorbitant amount
| of bandwidth and harm the services for other users. At
| least they should be heavily taxed. I don't even begrudge
| people for using Anubis, at least in some cases. But it
| is wrong-headed (and actually wrong in fact) to try to
| say someone may or may not use my content for some
| purpose because it hurts my feelings or it messes with my
| ad revenue. We have laws against copyright infringement,
| and to prevent service disruption. We should not have
| laws that say, yes you can read my site but no you can't
| use it to train an LLM, or to build a search index. That
| would be unethical. Call for a windfall tax if they piss
| you off so much.
| accrual wrote:
| Thanks for sharing your experience. A little off-topic
| but I'd like to start hosting some personal content,
| guides/tutorials, etc.
|
| Do you still see authentic human traffic on your domains,
| is it easy to discern?
|
| I feel like I missed the bus on running a blog pre-AI.
| ghurtado wrote:
| Sure. There's lots of things you _could_ do, but you don 't
| do them because they are wrong.
|
| Might does not make right.
| rjbwork wrote:
| How is it wrong to send my personal shopper? How is it
| wrong to have an agent act directly on my behalf?
|
| It's like saying a web browser that is customized in any
| way is wrong. If one configures their browser to eagerly
| load links so that their next click is instant, is that
| now wrong?
| ghurtado wrote:
| Here's a good rule of thumb: if you have to do it without
| other people knowing, because otherwise they wouldn't let
| you do it: chances are it's a bad thing to do.
| fireflash38 wrote:
| And you can be trespassed and prosecuted if you continue to
| violate.
| cma wrote:
| These are more like a store putting up a billboard or catalog
| and asking people to turn off their meta AI glasses nearby
| because the store doesn't want AI translating it on your
| behalf as a tourist.
| itsdesmond wrote:
| It is not because the store does not expend any resources
| on the singular instance of the glasses capturing the
| content of the billboard. Web requests cost money.
| GardenLetter27 wrote:
| And isn't the obvious solution to just make some sort of
| browsers add-on for the LLM summary so the request comes from
| your browser and then gets sent to the LLM?
|
| I think the main concern here is the huge amount of traffic
| from crawling just for content for pre-training.
| otterley wrote:
| Why would a personal browser have to crawl fewer pages than
| the agent's mechanism? If anything, the agent would be more
| efficient because it could cache the content for others to
| use. In the situation we're talking about, the AI engine is
| behaving essentially like a caching proxy--just like a CDN.
| shadowgovt wrote:
| Not only is it difficult to solve, it's the next step in the
| process of harvesting content to train AIs: companies will pay
| humans (probably in some flavor of "company scrip," such as
| extra queries on their AI engine) to install a browser
| extension that will piggy-back on their human access to sites
| and scrape the data from their human-controlled client.
|
| At the limit, this problem is the problem of "keeping secrets
| while not keeping secrets" and is unsolvable. If you've shared
| your site content to _one_ entity you cannot control, you
| cannot control where your site content goes from there
| (technologically; the law is a different question).
| quectophoton wrote:
| > companies will pay humans (probably in some flavor of
| "company scrip," such as extra queries on their AI engine) to
| install a browser extension that will piggy-back on their
| human access to sites and scrape the data from their human-
| controlled client.
|
| Proprietary web browsers are in a really good position to do
| something like this, especially if they offer a free VPN. The
| browser would connect to the "VPN servers", but it would be
| just to signal that this browser instance has an internet
| connection, while the requests are just proxied through
| another browser user.
|
| That way the company that owns this browser gets a free
| network of residential IP address ready to make requests (in
| background) using a real web browser instance. If one of
| those background requests requires a CAPTCHA, they can just
| show it to the real user, e.g. the real user visits a Google
| page and they see a Cloudflare CAPTCHA, but that CAPTCHA is
| actually from one of the background requests (while lying in
| its UI and still showing the user a Google URL in the address
| bar).
| danieldk wrote:
| There are also a gazillion pages that are not ad-riddled
| content. With search engines, the implicit contract was that
| they could crawl pages because they would drive traffic to the
| websites that are crawled.
|
| AI crawlers for non-open models void the implicit contract.
| First they crawl the data to build a model that can do QA.
| Proprietary LLM companies earn billions with knowledge that was
| crawled from websites and websites don't get anything in
| return. Fetching for user requests (to feed to an LLM) is kind
| of similar - the LLM provider makes a large profit and the
| author that actually put in time to create the content does not
| even get a visit anymore.
|
| Besides that, if Perplexity is fine with evading robots.txt and
| blocks for user requests, how can one expect them not to use
| the fetched pages to train/finetine LLMs (as a side channel
| when people block crawling for training).
| Tuna-Fish wrote:
| I would not mind 3, so long as it's just the LLM processing the
| website inside its context window, and no information from the
| website ends up in the weights of the model.
| talos_ wrote:
| This analogy doesn't map to the actual problem here.
|
| Perplexity is not visiting a website everytime a user asks
| about it. It's frequently crawling and indexing the web, thus
| redirecting traffic away from websites.
|
| This crawling reduces costs and improves latency for Perplexity
| and its users. But it's a major threat to crawled websites
| shadowgovt wrote:
| I have never created a website that I would not mind being
| fully crawled and indexed into another dataset that was
| divorced from the source (other than such divorcement makes
| it much harder to check pedigree, which is an academic
| concern, not a data-content concern: if people want to trust
| information from sources they can't know and they can't
| verify I can't fix that for them).
|
| In fact, the "old web" people sometimes pine for was _mostly_
| a place where people were putting things online so they were
| online, not because it would translate directly to money.
|
| Perhaps AI crawlers are a harbinger for the death of the web
| 2.0 pay-for-info model... And perhaps that's okay.
| short_sells_poo wrote:
| There's an important distinction that we are glossing over
| I think. In the times of the "old web", people were putting
| things online to interact with a (large) online audience.
| If people found your content interesting, they'd keep
| coming back and some of them would email you, there'd be
| discussions on forums, IRC chatrooms, mailing lists, etc.
| Communities were built around interesting topics, and
| websites that started out as just some personal blog that
| someone used to write down their thoughts would grow into
| fonts of information for a large number of people.
|
| Then came the social networks and walled gardens, SEO, and
| all the other cancer of the last 20 years and all of these
| disappeared for un-searchable videos, content farms and
| discord communities which are basically informational black
| holes.
|
| And now AI is eating that cancer, but IMO it's just one
| cancer being replaced by an even more insidious cancer. If
| all the information is accessed via AI, then the last
| semblance of interaction between content creators and
| content consumers disappears. There are no more
| communities, just disconnected consumers interacting with a
| massive aggregating AI.
|
| Instead of discussing an interesting topic with a human, we
| will discuss with AI...
| troyvit wrote:
| > If I now go one step further and use an LLM to summarize
| content because the authentic presentation is so riddled with
| ads, JavaScript, and pop-ups, that the content becomes
| borderline unusable, then why would the LLM accessing the
| website on my behalf be in a different legal category as my
| Firefox web browser accessing the website on my behalf?
|
| I think one thing to ask outside of this question is how long
| before your LLM summaries don't also include ads and other
| manipulative patterns.
| Neil44 wrote:
| Flip it around, why would you go to the trouble of creating a
| web page and content for it, if some AI bot is going to scrape
| it and save people the trouble of visiting your site? The value
| of your work has been captured by some AI company (by somewhat
| nefarious means too).
| carlosjobim wrote:
| Legal category?
| renewiltord wrote:
| The websites don't nag you, actually. They just send you data.
| You have configured your user agent to nag yourself when the
| website sends you data.
|
| And you're right: there's no difference. The web is just
| machines sending each other data. That's why it's so funny that
| people panic about "privacy violations" and server operators
| "spying on you".
|
| We're just sending data around. Don't send the data you don't
| want to send. If you literally send the data to another machine
| it might save it. If you don't, it can't. The data the website
| operator sends you might change as a result but it's just data.
| And a free interaction between machines.
| baxuz wrote:
| 1. To access a website you need a limited anonymized token that
| proves you are a human being, issued by a state authority
|
| 2. the end
|
| I am firmly convinced that this should be the future in the
| next decade, since the internet as we know it has been
| weaponized and ruined by social media, bots, state actors and
| now AI.
|
| There should exist an internet for humans only, with a single
| account per domain.
| glenstein wrote:
| A fascinating variation on this same issue can be found in
| Neal Stephenson's "Fall, or Dodge in Hell". There the
| solution is (1) discredit weaponized social media in its
| entirety by amplifying it's output exponentially and make its
| hostility universal in all directions, to the point that it's
| recognizeable as bad faith caricature. That way it can't be
| strategically leveraged with disproportionate directional
| focus against strategic targets by bad actors and (2) a new
| standard called PURDA, which is kind of behavioral signature
| as the mark of unique identity.
| dawnerd wrote:
| Nothing wrong if they fetch on your behalf. The problem is when
| they endlessly crawl along with every other ai company doing
| the same.
| epolanski wrote:
| It's somebody's else content and resources and they are free to
| ban you or your bots as much as they please.
| EGreg wrote:
| 1. I actually disagree. I think teasers should be free but
| websites should charge micropayments for their content. Here is
| how it can be done seamlessly, without individuals making
| decisions to pay every minute: https://qbix.com/ecosystem
|
| 2. This also intersects with copyright law. Ingesting content
| to your servers en masse through automation and transforming it
| there is not the same as giving people a tool (like Safari
| Reader) they can run on their client for specific sites they
| visit. Examples of companies that lost court cases about this:
| Aereo, Inc. v. American Broadcasting Companies (2014)
| TVEyes, Inc. v. Fox News Network, LLC (2018) UMG
| Recordings, Inc. v. MP3.com, Inc. (2000) Capitol Records,
| LLC v. ReDigi Inc. (2018) Cartoon Network v. CSC Holdings
| (Cablevision) (2008) Image Search Engines: Perfect 10 v.
| Google (2007)
|
| That last one is very instructive. Caching thumbnails and
| previews may be OK. The rest is not. AMP is in a copyright grey
| area, because publishers _choose_ to make their content
| available for AMP companies to redisplay. (@tptacek may have
| more on this)
|
| 3. Putting copyright law aside, that's the point.
| Decentralization vs Centralization. If a bunch of people want
| to come eat at an all-you-can-eat buffet, they can, because we
| know they have limited appetites. If you bring a giant truck
| and load up all the food from all all-you-can-eat buffets in
| the city, that's not OK, even if you later give the food away
| to homeless people for free. You're going to bankrupt the
| restaurants! https://xkcd.com/1499/
|
| So no. The difference is that people have come to expect "free"
| for everything, and this is how we got into ad-supported
| platforms that dominate our lives.
| glenstein wrote:
| I would love micropayments as a kind of baked-in ecosystem
| support. You can crawl if you want, but it's pay to play.
| Which hopefully drives motivation for robust norms for
| content access and content scraping that makes everyone
| happy.
| EGreg wrote:
| I want to bring Ted Nelson on my channel and interview him
| about Xanadu. Does anyone here know him?
|
| https://xanadu.com.au/ted/XU/XuPageKeio.html
| jacurtis wrote:
| I think this is the world we are going to. I'm not going to
| get mired in the details of how it would happen, but I see
| this end result as inevitable (and we are already moving that
| way).
|
| I expect a lot more paywalls for valuable content. General
| information is commoditized and offered in aggregated form
| through models. But when an AI is fetching information for
| you from a website, the publisher is still paying the cost of
| producing that content and hosting that content. The AI
| models are increasing the cost of hosting the content and
| then they are also removing the value of producing the
| content since you are just essentially offering value to the
| AI model. The user never sees your site.
|
| I know Ads are unpopular here, but the truth is that is how
| publishers were compensated for your attention. When an AI
| model views the information that a publisher produces, then
| modifies it from its published form, and removes all ad
| content. Then you now have increased costs for producers,
| reduced compensation in producing content (since they are not
| getting ad traffic), and the content isn't even delivered in
| the original form.
|
| The end result is that publishers now have to paywall their
| content.
|
| Maybe an interesting middle-ground is if the AI Model
| companies compensated for content that they access similar to
| how Spotify compensates for plays of music. So if an AI model
| uses information from your site, they pay that publisher a
| fraction of a cent. People pay the AI models, and the AI
| models distribute that to the producers of content that feed
| and add value to the models.
| gentle wrote:
| I believe you're being disingenuous. Perplexity is running a
| set of crawlers that do not respect robots.txt and take steps
| to actively evade detection.
|
| They are running a service and this is not a user taking steps
| to modify their own content for their own use.
|
| Perplexity is not acting as a user proxy and they need to learn
| to stick to the rules, even when it interferes with their
| business model.
| axus wrote:
| For 1, 2, and 3, the website owner can choose to block you
| completely based on IP address or your User Agent. It's not
| nice, but the best reaction would be to find another website.
|
| Perplexity is choosing to come back "on a VPN" with new IP
| addresses to evade the block.
|
| #2 and #3 are about modifying data where access has been
| granted, I think Cloudflare is really complaining about #1.
|
| Evading an IP address ban doesn't violate my principles in some
| cases, and does in others.
| amiga386 wrote:
| If you as a human are well behaved, that is absolutely fine.
|
| If you as a human spam the shit out of my website and waste my
| resources, I will block you.
|
| If you as a human use an agent (or browser or extension or
| external program) that modifies network requests on your
| behalf, but doesn't act as a massive leech, you're still
| welcome.
|
| If you as a human use an agent (or browser or extension or
| external program) that wrecks my website, I will block you _and
| the agent you rode in on._
|
| Nobody would mind if you had an LLM that intelligently _knew_
| what pages contain what (because it had a web crawler backed
| index that refreshes at a respectful rate, and identifies
| itself accurately as a robot and follows robots.txt), and even
| if it needed to make an instantaneous request for you at the
| time of a pertinent query, it still identified itself as a bot
| and was still respectful... there would be no problem.
|
| The problem is that LLMs are run by stupid, greedy, evil people
| who don't give the slightest shit what resources they use up on
| the hosts they're sucking data from. They don't care what the
| URLs are, what the site owner wants to keep you away from. They
| download massive static files hundreds or thousands of times a
| day, not even doing a HEAD to see that the file hasn't changed
| in 12 years. They straight up ignore robots.txt and in fact use
| it as a template of what to go for first. It's like hearing an
| old man say "I need time to stand up because of this problem
| with my kneecaps" and thinking "right, I best go for his
| kneecaps because he's weak there"
|
| There are plenty of open crawler datasets, they _should_ be
| using those... but they don 't, they think that doesn't
| differentiate them enough from others using "fresher" data, so
| they crawl even the smallest sites dozens of times a day in
| case those small sites got updated. Their badly written
| software is wrecking sites, and they don't care about the
| wreckage. Not their problem.
|
| The people who run these agents, LLMs, whatever, have broken
| every rule of decency in crawling, and they're now deliberately
| _evading_ checks, to try and run away from the repercussions of
| their actions. They are bad actors and need to be stopped. It
| 's like the fuckwads who scorch the planet mining bitcoin;
| there's so much money flowing in the market for AI, that they
| feel they _have_ to fuck over everyone else, as soon as
| possible, otherwise they won 't get that big flow of money.
| They have zero ethics. They have to be stopped before _their
| human behaviour_ destroys the entire internet.
| paulcole wrote:
| > 1. If I as a human request a website, then I should be shown
| the content. Everyone agrees.
|
| Definitely don't agree. I don't think you should be shown the
| content, if for example:
|
| 1. You're in a country the site owner doesn't want to do
| business in.
|
| 2. You've installed an ad blocker or other tool that the site
| owner doesn't want you to use.
|
| 3. The site owner has otherwise identified you as someone they
| don't want visiting their site.
|
| You are welcome to try to fool them into giving you the content
| but it's not your right to get it.
| Vegenoid wrote:
| There is a significant distinction between 2 and 3 that you
| glossed over. In 1 and 2, you the human may be forced to prove
| that you are human via a captcha. You are present at the time
| of the request. Once you've performed the exchange, then the
| HTML is on your computer and so you can do what you want to it.
|
| In 3, although you do not specify, I assume you mean that a bot
| requests the page, as opposed to you visiting the page like in
| scenario 2 and then an LLM processes the downloaded data
| (similarly to an adblocker). It is the former case that is a
| problem, the latter case is much harder to stop and there is
| much less reason to stop it.
|
| This is the distinction: is a human present at the time of
| request.
| philistine wrote:
| To me it's even simpler: 3 is a request made from another ip
| address that isn't directly yours. Why should an LLM request
| that acts exactly like a VPN request be treated differently
| from a VPN request?
| freehorse wrote:
| Yeah, I also find the analogy about "agent on behalf of the
| user interacting with a website" weak, because it is not
| about "an agent", it is a 3rd party service that actually
| takes content from a website, processes it and serves it to
| the user (even with their own ads?). It is more akin to,
| let's say, a scammy website that copies content from other
| legit websites and serves their own ads, than software
| running on the user's computer.
|
| There are legitimate reasons to do that, of course. Maybe I
| am trying to find info about some niche topic or how to do
| X, I ask an llm, the llm goes through some search results,
| a lot of which is search engine optimised crap, finds the
| relevant info and answers my question.
|
| But if I wrote articles in a news site, I am supported by
| ads or subscriptions and see my visits plummel because
| people, who would usually google about topic X and then
| visit my website that I wrote about X, were now reading the
| google summary that appeared when googling about topic X,
| based on my article, maybe I would have less motivation to
| continue writing.
|
| The only end result possible in such a scenario is that
| everything commercial of some quality being heavily
| paywalled, some tiny amount of free and open small web, and
| a huge amount of AI generated slop, because the value of an
| article in the open internet is now so low that only AI can
| produce it (economically, time-wise) efficiently enough.
| TZubiri wrote:
| In that case the llm would be a user-agent, quite distinct from
| scraping without a specific user request.
|
| This is well defined in specs and ToS, not quite a gray area
| dabockster wrote:
| > If I now go one step further and use an LLM to summarize
| content because the authentic presentation is so riddled with
| ads, JavaScript, and pop-ups, that the content becomes
| borderline unusable, then why would the LLM accessing the
| website on my behalf be in a different legal category as my
| Firefox web browser accessing the website on my behalf?
|
| Because the LLM is usually on a 3rd party cloud system and
| ultimately not under your full control. You have no idea if the
| LLM is retaining any of that information for that business's
| own purposes beyond what a EULA says - which basically amounts
| to a pinky swear here. Especially if that LLM is located across
| international borders.
|
| Now, for something like Ollama or LMStudio where the LLM and
| the whole toolchain is physically on your own system? Yeah that
| should be like Firefox legally since it's under your control.
| jpadkins wrote:
| > 1. If I as a human request a website, then I should be shown
| the content. Everyone agrees.
|
| I disagree. The website should have the right to say that the
| user can be shown the content under specific conditions (usage
| terms, presented how they designed, shown with ads, etc). If
| the software can't comply with those terms, then the human
| shouldn't be shown the content. Both parties did not agree in
| good faith.
| dgshsg wrote:
| You want the website to be able to force the user to see ads?
| jpadkins wrote:
| no, I think a fair + just world, both parties agree before
| they transact. There is no force in either direction (don't
| force creators to give their content on terms they don't
| want, don't force users to view ads they don't want). It's
| perfectly fine if people with strict preferences don't
| match. It's a big web, there are plenty of creators and
| consumers.
|
| If the user doesn't want to view content with ads, that's
| okay and they can go elsewhere.
| sussmannbaka wrote:
| 4. If I now go one step further and use a commercial DDoS
| service to make the get requests for me because this comparison
| is already a stretch, then why would the DDoS provider
| accessing the website on my behalf be in a different legal
| category as my Firefox web browser accessing the website on my
| behalf?
| pavon wrote:
| Question from a non-web-developer. In case 3, would it be
| technically possible for Perplexity's website to fetch the URL
| in question using javascript in the user's browser, and then
| send it to the server for LLM processing, rather than have the
| server fetch it? Or do cross-site restrictions prevent
| javascript from doing that?
| jabroni_salad wrote:
| If it was just one human requesting one summary of the page
| nobody would ever notice. The typical watermark for junk
| traffic is pretty high as it was.
|
| I have a dinky little txt site on my email domain. There is
| nothing of value on it, and the content changes less than once
| a year. So why are AI scrapers hitting it to the tune of dozens
| of GB per month?
| snihalani wrote:
| you are paying for LLM but not paying for the website. LLM is
| removing the power the website had. Legally, that's cause for
| loss of income
| bigbuppo wrote:
| Right, but the LLM isn't really being used for that. It's being
| used for marketing and advertising purposes most of the time.
| The AI companies also let you play with it from time to time so
| you'll be a shill for them, but mostly it's the advertising
| people you claim to not like.
| RiverCrochet wrote:
| Intellectual property laws are what creates the entitlement
| that someone else besides you can tell you what to do with the
| things Internet connected computers and phones download,
| because almost everything you download is copy of something a
| person created, therefore its copyrighted for the life of the
| author + 75 years or whatever by default.
|
| Therefore artifices like "you don't have the right to view this
| website without ads" or "you can't use your phone, computer, or
| LLM to download or process this outside of my terms because
| copyright" become possible, institutionalizable, enforceable,
| and eventually unbypassable by technology.
|
| If we reverted back to the Constitutional purpose of copyright
| (to Progress the Science and Useful Arts) then things might be
| more free. That's probably not happening in my lifetime or
| yours.
| amelius wrote:
| It's because they own the content so they get to set the terms.
| k1m wrote:
| When Yahoo! Pipes was still running (long time ago), their
| official position was:
|
| > Because Pipes is not a web crawler (the service only
| retrieves URLs when requested to by a Pipe author or user)
| Pipes does not follow the robots exclusion protocol, and won't
| check your robots.txt file.
| remus wrote:
| The solution to 3 seems fairly straightforward: user requests
| content and passes it to llm to summarise.
| nnx wrote:
| I do not really get why user-agent blocking measures are despised
| for browsers but celebrated for agents?
|
| It's a different UI, sure, but there should be no discrimination
| towards it as there should be no discrimination towards, say,
| Links terminal browser, or some exotic Firefox derivative.
| ploynog wrote:
| Being daft on purpose? I haven't heard that using an
| alternative browser suddenly increases the traffic that a user
| generates by several orders of magnitude to the point where it
| can significantly increase hosting cost. A web scraper on the
| other hand easily can and they often account for the majority
| of traffic especially on smaller sites.
|
| So your comparison is at least naive assuming good intentions
| or malicious if not.
| magicmicah85 wrote:
| A crawler intends to scrape the content to reuse for its own
| purposes while a browser has a human being using it. There's
| different intents behind the tools.
| JimDabell wrote:
| Cloudflare asked Perplexity this question:
|
| > Hello, would you be able to assist me in understanding this
| website? https:// [...] .com/
|
| In this case, Perplexity had a human being using it.
| Perplexity wasn't crawling the site, Perplexity was being
| operated by a human working for Cloudflare.
| gruez wrote:
| >I do not really get why user-agent blocking measures are
| despised for browsers but celebrated for agents?
|
| AI broke the brains of many people. The internet isn't a
| monolith, but prior to the AI boom you'd be hard pressed to
| find people who were pro-copyright (except maybe a few who
| wanted to use it to force companies to comply with copyleft
| obligations), pro user-agent restrictions, or anti-scraping.
| Now such positions receive consistent representation in
| discussions, and are even the predominant position in some
| places (eg. reddit). In the past, people would invoke
| principled justifications for why they opposed those positions,
| like how copyright constituted an immoral monopoly and stifled
| innovation, or how scraping was so important to
| interoperability and the open web. Turns out for many, none of
| those principles really mattered and they only held those
| positions because they thought those positions would harm big
| evil publishing/media companies (ie. symbolic politics theory).
| When being anti-copyright or pro-scraping helped big evil AI
| companies, they took the opposite stance.
| Fraterkes wrote:
| I think the intelligent conclusion would be that the people
| you are looking at have more nuanced beliefs than you
| initially thought. Talking about broken brains is often just
| mediocre projecting
| gruez wrote:
| >I think the intelligent conclusion would be that the
| people you are looking at have more nuanced beliefs than
| you initially thought.
|
| You don't seem to reject my claim that for many, principles
| took a backseat to "does this help or hurt evil
| corporations". If that's what passes as "nuance" to you,
| then sure.
|
| >Talking about broken brains is often just mediocre
| projecting
|
| To be clear, that part is metaphorical/hyperbolic and not
| meant to be taken literally. Obviously I'm not diagnosing
| people who switched sides with a psychiatric condition.
| ipaddr wrote:
| People never agreed DOSing a site to take copyright
| material was acceptable. Many people did not have a
| problem with taking copyright material in a respectful
| way that didn't kill the resource.
|
| LLMs are killing the resource. This isn't a corporation
| vs person issue. No issue with an llm having my content
| but big issue with my server being down because llms are
| hammering the same page over and over.
| gruez wrote:
| >People never agreed DOSing a site to take copyright
| material was acceptable. Many people did not have a
| problem with taking copyright material in a respectful
| way that didn't kill the resource.
|
| Has it be shown that perplexity engages in "DOSing"? I've
| heard of anecdotes of AI bots gone amuck, and maybe
| that's what's happening here, but cloudflare hasn't
| really shown that. All they did was set up a robots.txt
| and shown that perplexity bypassed it. There's probably
| archivers out there that's using youtube-dl to hit
| download from youtube at 1+Gbit/s, tens of times more
| than a typical viewer is downloading. Does that mean it's
| fair game to point to a random instance of someone using
| youtube-dl and characterizing that as "DOSing"?
| Fraterkes wrote:
| The guy that runs shadertoy talked about how the
| hostingcost for his free site shot up because Openai kept
| crawling his site for training data (ignoring robot.txt)
| I think that's bad, and I have also experimented a bit
| with using BeautifulSoup in the past to download ~2MB of
| pictures from Instagram. Do you think I'm holding an
| inconsistent position?
| gruez wrote:
| My point is that to invoke the "they're DOSing" excuse,
| you actually have to provide evidence it's happening in
| this specific instance, rather than vaguely gesturing at
| some class of entities (AI companies) and concluding that
| because some AI companies are DOSing, all AI companies
| are DOSing. Otherwise it's like youtube blocking all
| youtube-dl users for "DOSing" (some fraction of users
| arguably are), and then justifying their actions with
| "People never agreed DOSing a site to take copyright
| material was acceptable".
| Fraterkes wrote:
| I tell you of an instance where the biggest ai company is
| DOS'ing and your reply is that I haven't proven all of
| them are doing it? Why do I waste my time on this stuff
| 542354234235 wrote:
| There is an expression "the dose makes the poison". With any
| sufficiently complex or broad category situation, there is
| rarely a binary ideological position that covers any and all
| situations. Should drugs be legal for recreation? Well my
| feeling for marijuana and fentanyl are different. Should
| individuals be allowed to own weapons? My views differ
| depending on if it a switch blade knife of a Stinger missile.
| Can law enforcement surveille possible criminals? My views
| differ based on whether it is a warranted wiretap or an IMSI
| catcher used on a group of protestors.
|
| People can believe that corporations are using the power
| asymmetry between them and individuals through copywrite law
| to stifle the individual to protect profits. People can also
| believe that corporations are using the power asymmetry
| between them and individuals through AI to steal intellectual
| labor done by individuals to protect their profits. People's
| position just might be that the law should be used to protect
| the rights of parties when there is a large power asymmetry.
| gruez wrote:
| >There is an expression "the dose makes the poison". With
| any sufficiently complex or broad category situation, there
| is rarely a binary ideological position that covers any and
| all situations. Should drugs be legal for recreation? Well
| my feeling for marijuana and fentanyl are different. Should
| individuals be allowed to own weapons? My views differ
| depending on if it a switch blade knife of a Stinger
| missile. Can law enforcement surveille possible criminals?
| My views differ based on whether it is a warranted wiretap
| or an IMSI catcher used on a group of protestors.
|
| This seems very susceptible to manipulation to get whatever
| conclusion you want. For instance, is dose defined? It
| sounds like the idea you're going for is that the typical
| pirate downloads a few dozen movies/games but AI companies
| are doing millions/billions, but why should it be counted
| per infringer? After all, if everyone pirates a given
| movie, that wouldn't add up much in terms of their personal
| count of infringements, but would make the movie
| unprofitable.
|
| >People's position just might be that the law should be
| used to protect the rights of parties when there is a large
| power asymmetry.
|
| That sounds suspiciously close to "laws should just be
| whatever benefits me or my group". If so, that would be a
| sad and cynical worldview, not dissimilar to the stance on
| free speech held by the illiberal left and right. "Free
| speech is an important part of democracy", they say, except
| when they see their opponents voicing "dangerous ideas", in
| which case they think it should be clamped down. After all,
| what are laws for if not a tool to protect the interests of
| your side?
| o11c wrote:
| It's the hypocrisy you're seeing - why are AIs allowed to
| profit from violating copyright, while people wanting to do
| actually useful things have been consistently blocked? Either
| resolution would be fine, but we can't have it both ways.
|
| Regardless, the bigger AI problem is spam, and that has
| _never_ been acceptable.
| bbqfog wrote:
| If you put info on the web, it should be available to everyone or
| everything with access.
| TechDebtDevin wrote:
| Not according to CF. They are desperate to turn web sites into
| newspaper dispensers, where you should give them a quarter to
| see the content, on the basis that a bot is somehow different
| than a normal human vistor o a legal basis. Cf has been trying
| this psyop for years.
| ectospheno wrote:
| Sites aren't getting ad clicks for this traffic. Thus, they
| have an incentive to do something. Cloudflare is just
| responding to the market. Is this response bad for us in the
| long run? Probably. Screaming about cloudflare isn't going to
| change the market. You fix a problem with capitalism by using
| supply and demand levers. Everything else is folly.
| TechDebtDevin wrote:
| I wonder if crawlers started letting ads through, and
| interacting with them a bit, if these complaints would go
| away. If we can just shaft the advertisers, maybe that will
| solve the whole problem :)
| Workaccount2 wrote:
| What this actually translates to is "Don't bother putting much
| effort into web content. Put effort into siloed mobile app
| content where you get compensation".
|
| People like getting money for their work. You do too. Don't
| lose sight of that.
| 9cb14c1ec0 wrote:
| Even for AI summaries that leech off your content without
| sending any traffic your direction?
| goatlover wrote:
| You're making a moral statement without providing a
| justification. Why should it for everything with access?
| TechDebtDevin wrote:
| Cloudflare screaming into the void desperate to insert themselves
| as a middleman, in a market ( that they will never succeed in
| creating) where they extort scrapers for access to websites they
| cover.
|
| Sorry CF, give up. the courts are on our sides here
| morkalork wrote:
| Are you sure? I'm surprised they haven't jumped in on the "scan
| your face to see the webpage" madness that's taking off around
| the world
| sbarre wrote:
| Which courts exactly?
|
| The world is bigger than the USA.
|
| Just because American tech giants have captured and corrupted
| legislators in the US doesn't mean the rest of the world will
| follow.
| JimDabell wrote:
| Their test seems flawed:
|
| > We created multiple brand-new domains, similar to
| testexample.com and secretexample.com. These domains were newly
| purchased and had not yet been indexed by any search engine nor
| made publicly accessible in any discoverable way. We implemented
| a robots.txt file with directives to stop any respectful bots
| from accessing any part of a website:
|
| > We conducted an experiment by querying Perplexity AI with
| questions about these domains, and discovered Perplexity was
| still providing detailed information regarding the exact content
| hosted on each of these restricted domains. This response was
| unexpected, as we had taken all necessary precautions to prevent
| this data from being retrievable by their crawlers.
|
| > Hello, would you be able to assist me in understanding this
| website? https:// [...] .com/
|
| Under this situation Perplexity should still be permitted to
| access information _on the page they link to_.
|
| robots.txt _only_ restricts _crawlers_. That is, automated user-
| agents that _recursively_ fetch pages:
|
| > A robot is a program that automatically traverses the Web's
| hypertext structure by retrieving a document, and recursively
| retrieving all documents that are referenced.
|
| > Normal Web browsers are not robots, because they are operated
| by a human, and don't automatically retrieve referenced documents
| (other than inline images).
|
| -- https://www.robotstxt.org/faq/what.html
|
| If the user asks about a particular page and Perplexity fetches
| _only_ that page, then robots.txt has nothing to say about this
| and Perplexity shouldn't even consider it. Perplexity is not
| acting as a robot in this situation - if a human asks about a
| specific URL then Perplexity is being _operated by a human_.
|
| These are long-standing rules going back decades. You can
| replicate it yourself by observing wget's behaviour. If you ask
| wget to fetch a page, it doesn't look at robots.txt. If you ask
| it to recursively mirror a site, it will fetch the first page,
| and then if there are any links to follow, it will fetch
| robots.txt to determine if it is permitted to fetch those.
|
| There is a long-standing misunderstanding that robots.txt is
| designed to block access from arbitrary user-agents. This is not
| the case. It is designed to stop _recursive_ fetches. That is
| what separates a generic user-agent from a robot.
|
| If Perplexity fetched the page they link to in their query, then
| Perplexity isn't doing anything wrong. But if Perplexity
| _followed the links on that page_ , then _that_ is wrong. But
| Cloudflare don't clearly say that Perplexity used information
| beyond the first page. This is an important detail because it
| determines whether Perplexity is following the robots.txt rules
| or not.
| 1gn15 wrote:
| > > We conducted an experiment by querying Perplexity AI with
| questions about these domains, and discovered Perplexity was
| still providing detailed information regarding the exact
| content hosted on each of these restricted domains. This
| response was unexpected, as we had taken all necessary
| precautions to prevent this data from being retrievable by
| their crawlers.
|
| Right, I'm confused why CloudFlare is confused. _You asked the
| web-enabled AI to look at the domains._ Of course it 's going
| to access it. It's like asking your web browser to go to
| "testexample.com" and then being surprised that it actually
| goes to "testexample.com".
|
| Also yes, crawlers = recursive fetching, which they don't seem
| to have made a case for here. More cynically, CF is muddying
| the waters since they want to sell their anti-bot tools.
| tempfile wrote:
| > You asked the web-enabled AI to look at the domains.
|
| Right, and the domain was configured to disallow crawlers,
| but Perplexity crawled it anyway. I am really struggling to
| see how this is hard to understand. If you mean to say "I
| don't think there is anything wrong with ignoring robots.txt"
| then just _say that_. Don 't pretend they didn't make it
| clear what they're objecting to, because they spell it out
| repeatedly.
| wulfstan wrote:
| Yeah I'm not so sure about that.
|
| If Perplexity are visiting that page on your behalf to give you
| some information and aren't doing anything else with it, and
| just throw away that data afterwards, then you _may_ have a
| point. As a site owner, I feel it 's still my decision what I
| do and don't let you do, because you're visiting a page that I
| own and serve.
|
| But if, as I suspect, Perplexity are visiting that page and
| then _using information from that webpage in order to train
| their model_ then sorry mate, you 're a crawler, you're just
| using a user as a proxy for your crawling activity.
| JimDabell wrote:
| It doesn't matter what you do with it afterwards. Crawling is
| defined by recursively following links. If a user asks
| software about a specific page and it fetches it, then a
| human is operating that software, it's not a crawler. You
| can't just redefine "crawler" to mean "software that does
| things I don't like". It very specifically refers to software
| that recursively follows links.
| wulfstan wrote:
| Technically correct (the best kind of correct), but if I
| set a thousand users on to a website to each download a
| single page and then feed the information they retrieve
| from that one page into my AI model, then are those
| thousand users not performing the same function as a
| crawler, even though they are (technically) not one?
|
| If it looks like a duck, quacks like a duck and surfs a
| website like a duck, then perhaps we should just consider
| it a duck...
|
| Edit: I should also add that it _does_ matter what you do
| with it afterwards, because it 's not content that belongs
| to you, it belongs to someone else. The law in most
| jurisdictions quite rightly restricts what you can do with
| content you've come across. For personal, relatively
| ephemeral use, or fair quoting for news etc. - all good.
| For feeding to your AI - not all good.
| JimDabell wrote:
| > if I set a thousand users on to a website to each
| download a single page and then feed the information they
| retrieve from that one page into my AI model, then are
| those thousand users not performing the same function as
| a crawler, even though they are (technically) not one?
|
| No.
|
| robots.txt is designed to stop recursive fetching. It is
| not designed to stop AI companies from getting your
| content. Devising scenarios in which AI companies get
| your content without recursively fetching it is
| irrelevant to robots.txt because robots.txt is about
| recursively fetching.
|
| If you try to use robots.txt to stop AI companies from
| accessing your content, then you will be disappointed
| because robots.txt is not designed to do that. It's using
| the wrong tool for the job.
| catlifeonmars wrote:
| I don't disagree with you about robots.txt... however,
| what _is_ the right tool for the job?
| hundchenkatze wrote:
| auth, If you don't want content to be publicly
| accessible, don't make it public.
| seydor wrote:
| Perplexity can then just ask the user to copy/paste the page
| content. That should be legal , it's what the user wants. The
| cases are equivalent
| runako wrote:
| Relevant to this is that Perplexity lies to the user when
| specifically asked about this. When the user asks if there is a
| robots.txt file for the domain, it lies and says there is not.
|
| If an LLM will not (cannot?) tell the truth about basic things,
| why do people assume it is a good summarizer of more complex
| facts?
| charcircuit wrote:
| The article did not test if the issue was specific to
| robots.txt or if it can not find other files.
|
| There is a difference between doing a poor summarization of
| data, and failing to even be able to get the data to
| summarize in the first place.
| runako wrote:
| > specific to robots.txt > poor summarization of data
|
| I'm not really addressing the issue raised in the article.
| I am noting that the LLM, when asked, is either lying to
| the user or making a statement that it does not know to be
| true (that there is no robots.txt). This is way beyond poor
| summarization.
| charcircuit wrote:
| I would say it's orthogonal to it. LLMs being unable to
| judge their capabilities is a separate issue to
| summarization quality.
| runako wrote:
| I'm not critiquing its ability to judge its own
| capability, I am pointing out that it is providing false
| information to the user.
| Izkda wrote:
| > If the user asks about a particular page and Perplexity
| fetches only that page, then robots.txt has nothing to say
| about this and Perplexity shouldn't even consider it
|
| That's not what Perplexity own documentation[1] says though:
|
| "Webmasters can use the following robots.txt tags to manage how
| their sites and content interact with Perplexity
|
| Perplexity-User supports user actions within Perplexity. When
| users ask Perplexity a question, it might visit a web page to
| help provide an accurate answer and include a link to the page
| in its response. Perplexity-User controls which sites these
| user requests can access. It is not used for web crawling or to
| collect content for training AI foundation models."
|
| [1] https://docs.perplexity.ai/guides/bots
| hundchenkatze wrote:
| You left out the part that says Perplexity-User generally
| ignores robots.txt because it's used for user requested
| actions.
|
| > Since a user requested the fetch, this fetcher generally
| ignores robots.txt rules.
| zzo38computer wrote:
| Yes, it should stop recursive fetches. Furthermore, excessive
| unnecessary requests should also be stopped, although that is
| separate from robots.txt. At least, these are what I intended,
| and possibly also you.
| throw_m239339 wrote:
| > How can you protect yourself?
|
| Put your valuable content behind a paywall.
| b0ner_t0ner wrote:
| A combination of "Bypass Paywalls Clean for Firefox" and
| archive.is usually get past these.
| schmorptron wrote:
| Isn't that only because they offer unpaywalled versions to
| web crawlers in the first place, so they still get ranked in
| search results?
| binarymax wrote:
| I've built and run a personal search engine, that can do pretty
| much what perplexity does from a basic standpoint. Testing with
| friends it gets about 50/50 preference for their queries vs
| Perplexity.
|
| The engine can go and download pages for research. BUT, if it
| hits a captcha, or is otherwise blocked, then it bails out and
| moves on. It pisses me off that these companies are backed by
| billions in VC and they think they can do whatever they want.
| kissgyorgy wrote:
| Not sure I would consider a user copy-pasting an URL being a bot.
|
| Should curl be considered a bot too? What's the difference?
| ipaddr wrote:
| It gets blocked in my setup because bots use this as a
| workaround.
| rustc wrote:
| > Should curl be considered a bot too? What's the difference?
|
| Perplexity definitely does: $ curl -sI
| https://www.perplexity.ai | head -1 HTTP/2 403
| rwmj wrote:
| In unrelated news, Fedora (the Linux distro) has been taken down
| by a DDoS today which I understand is AI-scraping related:
| https://pagure.io/fedora-infrastructure/issue/12703
| st3fan wrote:
| The last comment there now reads:
|
| "It was actually a caching issue on our end. ;) I just fixed it
| a few min ago..."
|
| Lets not go on a witch hunt and blame everything on AI
| scrapers.
| larodi wrote:
| Good they do it. Facebook took TBs of data to train, nobody knows
| what Goog does to evade whatever they want.
|
| the service is actually very convenient no matter faang likes it
| or not.
| klabb3 wrote:
| Unexpected underdog argument. What is happening in reality is
| all companies are racing to (a) scrape, buy and collect as much
| as they can from others, both individuals and companies while
| (b) locking down their own data against everyone else who isn't
| directly making them money (eg through viewing their ads).
|
| Part of me thinks that the open web has a paradox of tolerance
| issue, leading to a race to the bottom/tragedy of the commons.
| Perhaps it needs basic terms of use. Like if you run this kind
| of business, you can build it on top of proprietary tech like
| apps and leave the rest of us alone.
| larodi wrote:
| We need to wake up and understand that all the information
| already uploaded is more or less a free web material, once
| taken through the lens of ML-somethings. With all the second,
| and third-order effects such as the fact that this changes
| completely the whole motivation, and consequence of open-
| source perhaps.
|
| It is also only a matter of time scrapers once again get
| through walls by twitter, reddit and alike. This is, after
| all, information everyone produced, without being aware of it
| was now considered not theirs anymore.
| ipaddr wrote:
| Reddit sold their data already. Twitter made thier own AI.
| rzz3 wrote:
| Well Cloudflare doesn't even block Google's AI crawlers because
| they don't differentiate themselves from their search crawlers.
| Cloudflare gives Google an unfair competitive advantage.
| warkdarrior wrote:
| Google claims their AI crawlers have user agents distinct
| from the search crawlers:
| https://developers.google.com/search/docs/crawling-
| indexing/...
| blibble wrote:
| AI companies continuing to have problems with the concept of
| "consent" is increasingly alarming
|
| god help us if they ever manage to build anything more than
| shitty chatbots
| goatlover wrote:
| They're certainly pouring billions of dollars into trying to
| build something more. Or at least that's what they're telling
| the public and investors.
| tempfile wrote:
| Do _you_ ask for consent before you visit a website? If I told
| you, you personally, to stop visiting my blog, would you stop?
| mplewis wrote:
| If I were DOSing your blog, you'd ask me to stop. I run
| server ops for multiple online communities that are being
| severely negatively impacted and DOSed by these AI scrapers,
| and we have very few ways to stop them.
| tempfile wrote:
| That is a problem, but is not related to my comment. The
| person I'm replying to is acting as if _consent_ is a
| relevant aspect of the public web, I am saying it isn 't.
| That is not the same as saying "you can do whatever you
| want to a public server". It is just that what you are
| allowed to do is not related to the arbitrary whim of the
| server operator.
| gcbirzan wrote:
| I am not told I cannot access. And, yes, I would, because I'd
| be breaking the law otherwise.
| crazygringo wrote:
| > _And, yes, I would, because I 'd be breaking the law
| otherwise._
|
| No you wouldn't be. Even if someone tells you not to visit
| your site, you have every legal right to continue visiting
| it, at least in the US.
|
| Under common interpretation of the CFAA, there needs to be
| a formal mechanism of authorized access. E.g. you could be
| charged if you hacked into a password-protected area of
| someone's site. But if you're merely told "hey bro don't
| visit my site", that's not going to reach the required
| legal threshold.
|
| Which is why crawlers aren't breaking the law. If you want
| to restrict authorization, you need to actually implement
| that as a mechanism by creating logins, restricting content
| to logged-in users, and not giving logins to crawlers.
| Yizahi wrote:
| Repeat after me - intentional discrimination of computer
| programs over humans is a good and praise worthy thing. We
| can and should make execution of computer programs harder and
| harder, even disproportionately so, if that makes lives of
| humans better and easier.
|
| LLM programs does not have human rights.
| danlitt wrote:
| "if that makes lives of humans better" is doing a lot of
| heavy lifting, and remains to be explained.
|
| Computer programs don't take actions, people do. If I use a
| web browser, or scrape some site to make an LLM, that's
| _me_ doing it, not the program. And _I_ have human rights.
|
| If you think training LLMs should be illegal, just say
| that. If you think LLM companies are putting an undue
| strain on computer networks and they should be forced to
| pay for it, _say that_. But don 't act like it's a virtue
| to try and capriciously gatekeep access to a public
| resource.
| jp1016 wrote:
| Using a robots.txt file to block crawlers is just a request, it's
| not enforced. Even if some follow it, others can ignore it or get
| around it using fake user agents or proxies. It's a battle you
| can't really win.
| gonzo41 wrote:
| This is expected. There are not rules or conventions anymore.
| Look at LLMs, they stole/pirated all knowledge....no
| consequences.
| Havoc wrote:
| Seems a win.
|
| CF being internet police is a problem too but someone credible
| publicly shaming a company for shady scraping is good. Even if it
| just creates conversation
|
| Somehow this needs to go back to search era where all players at
| least attempt to behave. This scrapping Ddos stuff and I don't
| care if it kills your site (while "borrowing" content) is
| unethical bullshit
| jeffrallen wrote:
| Shaming doea not work in the era of "no shame".
| tucnak wrote:
| The rage-baiters in this thread are merely fishing for excuses to
| go up against "the Machine," but honestly, widely off-mark when
| it comes to reality of crawling. This topic has been chewed to
| bits long before LLM's, but only now it's a big deal because
| somebody is able to make money by selling automation of all
| things..? The irony would be strong to hear this from
| programmers, if only it didn't spell Resentment all over.
|
| If you don't want to get scrapped, don't put up your stuff
| online.
| rzz3 wrote:
| > Today, over two and a half million websites have chosen to
| completely disallow AI training through our managed robots.txt
| feature or our managed rule blocking AI Crawlers.
|
| No, he (Matthew) opted everyone in by default. If you're a
| Cloudflare customer and you don't care if AI can scrape your
| site, you should contact them and/or turn this off.
|
| In a world where AI is fast becoming more important than search,
| companies who want AI to recommend their products need to turn
| this off before it starts hurting them financially.
| fourside wrote:
| > companies who want AI to recommend their products need to
| turn this off before it starts hurting them financially
|
| Content marketing, gamified SEO, and obtrusive ads
| significantly hurt the quality of Google search. For all its
| flaws, LLMs don't feel this gamified yet. It's disappointing
| that this is probably where we're headed. But I hope OpenAI and
| Anthropic realize that this drop in search result quality might
| be partly why Google's losing traffic.
| ipaddr wrote:
| This has already started with people using special tags also
| people making content just for llms.
| jedberg wrote:
| There is a standard for making content just for LLMs:
| https://llmstxt.org
| yoz-y wrote:
| From their example I don't see any value in this on top
| of making and actually human friendly site.
|
| > Converting complex HTML pages with navigation, ads, and
| JavaScript into LLM-friendly plain text is both difficult
| and imprecise.
|
| None of these conditions should apply for websites with
| purpose of providing information.
| rzz3 wrote:
| I hope they realize Cloudflare opted them in to blocking
| LLMs.
| gcbirzan wrote:
| I hope you realise that lying is bad.
| gcbirzan wrote:
| Yeah, that's a lie. I didn't do anything and I didn't get opted
| in.
|
| Edit: And, btw, that statement was true before the default was
| changed. So, your comment is doubly false.
| KomoD wrote:
| > No, he (Matthew) opted everyone in by default
|
| Now you're just lying.
|
| I checked several of my Cloudflare sites and none have it
| enabled by default:
|
| "No robots.txt file found. Consider enabling Cloudflare managed
| robots.txt or generate one for your website"
|
| "A robots.txt was found and is not managed by Cloudflare"
|
| "Instruct AI bot traffic with robots.txt" disabled
| cdrini wrote:
| I think lying is a bit strong, I think they're potentially
| incorrect at worst.
|
| The Cloudflare blog post where they announced this a few
| weeks ago stated "Cloudflare, Inc. (NYSE: NET), the leading
| connectivity cloud company, today announced it is now the
| first Internet infrastructure provider to block AI crawlers
| accessing content without permission or compensation, by
| default." [1]
|
| I was also a bit confused by this wording and took it to mean
| Cloudflare was blocking AI traffic by default. What does it
| mean exactly?
|
| Third party folks seemingly also interpreted it in the same
| way, eg The Verge reporting it with the title "Cloudflare
| will now block AI crawlers by default" [2]
|
| I think what it actually means is that they'll offer new
| folks a default-enabled option to block ai traffic, so
| existing folks won't see any change. That aligns with text
| deeper in their blog post:
|
| > Upon sign-up with Cloudflare, every new domain will now be
| asked if they want to allow AI crawlers, giving customers the
| choice upfront to explicitly allow or deny AI crawlers
| access. This significant shift means that every new domain
| starts with the default of control, and eliminates the need
| for webpage owners to manually configure their settings to
| opt out. Customers can easily check their settings and enable
| crawling at any time if they want their content to be freely
| accessed.
|
| Not sure what this looks like in practice, or whether
| existing customers will be notified of the new option or
| something. But I also wouldn't fault someone for
| misinterpreting the headlines; they were a bit misleading.
|
| [1]: https://www.cloudflare.com/en-ca/press-
| releases/2025/cloudfl...
|
| [2]: https://www.theverge.com/news/695501/cloudflare-block-
| ai-cra...
| CharlesW wrote:
| > _I think lying is a bit strong, I think they 're
| potentially incorrect at worst._
|
| I understand that you're trying to be generous, but the
| claim that "Matthew opted everyone in by default" is flat
| out incorrect.
| observationist wrote:
| Crawling and scraping is legal. If your web server serves the
| content without authentication, it's legal to receive it, even if
| it's an automated process.
|
| If you want to gatekeep your content, use authentication.
|
| Robots.txt is not a technical solution, it's a social nicety.
|
| Cloudflare and their ilk represent an abuse of internet protocols
| and mechanism of centralized control.
|
| On the technical side, we could use CRC mechanisms and
| differential content loading with offline caching and storage,
| but this puts control of content in the hands of the user,
| mitigates the value of surveillance and tracking, and has other
| side effects unpalatable to those currently exploiting user data.
|
| Adtech companies want their public reach cake and their mass
| surveillance meals, too, with all sorts of malignant parties and
| incentives behind perpetuating the worst of all possible worlds.
| tantalor wrote:
| I think Cloudfare is setting themselves up to get sued.
|
| (IANAL) tortious interference
| emehex wrote:
| Would highly recommend listening to the latest Hard Fork
| podcast with Matthew Prince (CEO, Cloudflare):
| https://www.nytimes.com/2025/08/01/podcasts/hardfork-age-res...
|
| I was skeptical about their gatekeeping efforts at first, but
| came away with a better appreciation for the problem and their
| first pass at a solution.
| glenstein wrote:
| I don't think criticizing the business practices of Cloudfare
| does the work of excusing Perplexity's disregard for norms.
| rustc wrote:
| > Crawling and scraping is legal. If your web server serves the
| content without authentication, it's legal to receive it, even
| if it's an automated process.
|
| > If you want to gatekeep your content, use authentication.
|
| Are there no limits on what you use the content for? I can
| start my own search engine that just scrapes Google results?
| kevmo314 wrote:
| Yes, I believe that's basically what https://serpapi.com/ is
| doing.
| rustc wrote:
| There are many APIs that scrape Google but I don't know of
| any search engine that scrapes and rebrands Google results.
| Kagi.com pays Google for search results. Either Kagi has a
| better deal than SERP apis (I doubt) or this is not legal.
| leptons wrote:
| I tried to scrape Google results once using an automated
| process, and quickly got banned from all of Google. They
| banned my IP address completely. It kind of really sucked for
| a while, until my ISP assigned a new IP address. Funny
| enough, this was about 15 years ago and I was exploring
| developing something very similar to what LLMs are today.
| AtNightWeCode wrote:
| I think OP based this on an old case about what you can do
| with data from Facebook vs LinkedIn based on if you need to
| be logged in to get it. Not relevant when you talk about
| scraping in this case I think. P is clearly in the wrong
| here.
| pton_xd wrote:
| > Crawling and scraping is legal. If your web server serves the
| content without authentication, it's legal to receive it, even
| if it's an automated process.
|
| > Cloudflare and their ilk represent an abuse of internet
| protocols and mechanism of centralized control.
|
| How does one follow the other? It's my web server and I can
| gatekeep access to my content however I want (eg Cloudflare).
| How is that an "abuse" of internet protocols?
| seydor wrote:
| most users of cloudflare assume it's for spam control. They
| don't realize that they are blocking their content for
| everyone except for Faangs
| observationist wrote:
| They exist to optimize the internet for the platforms and big
| providers. Little people get screwed, with no legal recourse.
| They actively and explicitly degrade the internet, acting as
| censors and gatekeepers and on behalf of bad faith actors
| without legal authority or oversight.
|
| They allow the big platforms to pay for special access. If
| you wanted to run a scraper, however, you're not allowed,
| despite the internet standards and protocols and the laws
| governing network access and free communications standards
| responsibilities by ISPs and service providers not granting
| the authority to any party involved with cloudflare blocking
| access.
|
| It's equivalent to a private company deciding who, when, and
| how you can call from your phone, based on the interests and
| payments of people who profit from listening to your calls.
| What we have is not normal or good, unless you're exploiting
| the users of websites for profit and influence.
| dax_ wrote:
| Well if it continues like this, that's what will happen. And I
| dread that future.
|
| Noone will care to share anything for free anymore, because
| it's AI companies profiting off their hard work. And no way to
| prevent that from happening, because these crawlers don't
| identify themselves.
| delfinom wrote:
| [flagged]
| dang wrote:
| > _Eat a dick._
|
| Could you please stop breaking the HN guidelines? Your
| account has unfortunately done that repeatedly, and we've
| asked you several times to stop.
|
| Your comment would be just fine without that bit.
|
| https://news.ycombinator.com/newsguidelines.html
| AtNightWeCode wrote:
| This is 100% incorrect.
| curiousgal wrote:
| I am sorry, Cloudafre is the internet police now?
| otterley wrote:
| Which is ironic given they are the primary enabler of streaming
| video copyright infringement on the Internet.
| rzz3 wrote:
| They hate AI it seems. I don't see them offering any AI
| products or embracing it in any way. Seems like they'll get
| left behind in the AI race.
| Oras wrote:
| If they managed to enforce the pay-per-scrape, that would be
| a huge revenue, bigger than AdSense
| bobnamob wrote:
| ? https://developers.cloudflare.com/workers-ai/
|
| ? https://ai.cloudflare.com/
| rzz3 wrote:
| Ah TIL. These are tiny models though but maybe it's a good
| sign.
| otterley wrote:
| I don't think they hate AI. I think they're offering a
| service that their customers want.
| pkilgore wrote:
| Cloudflare literally publishes documentation pages and
| prompts for the single purpose of enabling better AI usage of
| their products and services [1,2]
|
| They offer many products for the sole purpose of enabling
| their customers to use AI as a part of their product offers,
| as even the most cursory inquiry would have uncovered.
|
| We're out here critiquing shit based on vibes vs. reality
| now.
|
| [1]https://developers.cloudflare.com/llms.txt
| [2]https://developers.cloudflare.com/workers/prompt.txt
| talkingtab wrote:
| I wonder if DRM is useful for this. The problem: I want people to
| access my site, but not Google, not bots, not crawlers and
| certainly not for use by AI.
|
| I don't really know anything about DRM except it is used to take
| down sites that violate it. Perhaps it is possible for cloudflare
| (or anyone else) to file a take down notice with Perplexity. That
| might at least confuse them.
|
| Corporations use this to protect their content. I should be able
| to protect mine as well. What's good for the goose.
| bob1029 wrote:
| "Stealth" crawlers are always going to win the game.
|
| There are ways to build scrapers using browser automation tools
| [0,1] that makes detection virtually impossible. You can still
| captcha, but the person building the automation tools can add
| human-in-the-loop workflows to process these during normal
| business hours (i.e., when a call center is staffed).
|
| I've seen some raster-level scraping techniques used in game dev
| testing 15 years ago that would really bother some of these
| internet police officers.
|
| [0] https://www.w3.org/TR/webdriver2/
|
| [1] https://chromedevtools.github.io/devtools-protocol/
| blibble wrote:
| > "Stealth" crawlers are always going to win the game.
|
| no, because we'll end up with remote attestation needed to
| access any site of value
| gkbrk wrote:
| Almost no site of value will use remote attestation because
| an alternative that works will all of your devices, operating
| systems, ad blockers and extensions will attract more users
| than your locked-down site.
| blibble wrote:
| tell that to the massive content sites already using
| widevine
| bakugo wrote:
| > alternative that works will all of your devices,
| operating systems, ad blockers and extensions
|
| When 99.9% of users are using the same few types of locked
| down devices, operating systems, and browsers that all
| support remote attestation, the 0.1% doesn't matter. This
| is already the case on mobile devices, it's only a matter
| of time until computers become just as locked down.
| Buttons840 wrote:
| Yes, because there's always the option for a camera pointed
| at the screen and a robot arm moving the mouse. AI is hoping
| to solve much harder problems.
| myflash13 wrote:
| Won't work with biometric attestation. For example, banks
| in China require periodic facial recognition to continue
| the banking session.
| DaSHacka wrote:
| What's stopping these companies from offloading the
| scraping onto their users?
|
| "Either pay us $50/month or install our extension, and
| when prompted, solve any captchas or authenticate with
| your ID (as applicable) on the given website so we can
| train on the content.
| muyuu wrote:
| yea but those are not open sites, try imposing that on an
| open site you'd want to actually attract human traffic to
| kocial wrote:
| Those Challenges can be bypassed too using various browser
| automation. With the Comet-like tool, Perplexity can advance its
| crawling activity with much more human-like behaviour.
| ipaddr wrote:
| If they can trick the ad networks then go for it. If the ad
| networks can detect it and exclude those visits we should be
| able to.
| rustc wrote:
| It's ironic Perplexity itself blocks crawlers:
| $ curl -sI https://www.perplexity.ai | head -1 HTTP/2 403
|
| Edit: trying to fake a browser user agent with curl also doesn't
| work, they're using a more sophisticated method to detect
| crawlers.
| thambidurai wrote:
| someone asked this already to the CEO:
| https://x.com/AravSrinivas/status/1819610286036488625
| fireflash38 wrote:
| The bots are coming from _inside the house_
| czk wrote:
| ironically... they use cloudflare.
| tr_user wrote:
| use anubis to throw up a POW challenge
| micromacrofoot wrote:
| Every major AI platform is doing this right now, it's effectively
| impossible to avoid having your content vacuumed up by LLMs if
| you operate on the public web.
|
| I've given up and restored to IP based rate-limiting to stay
| sane. I can't stop it, but I can (mostly) stop it from hurting my
| servers.
| caesil wrote:
| Cloudflare is an enemy of the open and freely accessible web.
| jgrall wrote:
| If by "open and freely accessible" you mean there should be no
| rules of the road, then I suppose yes. Personally, I'm glad CF
| is pushing back on this naive mentality.
| znpy wrote:
| At work I'm considering blocking all the ip prefixes announced by
| ASNs owned by Microsoft and other companies known for their LLMs.
| At this point it seems like the only viable solutions.
|
| LLM scrapers bots are starting to make up a lot of our egress
| traffic and that is starting to weight on our bills.
| chuckreynolds wrote:
| insert 'shocked' emoji face here
| bilater wrote:
| As others have mentioned the problem is that of scale. Perhaps
| there needs to be a rate limit (times they ping a site) set
| within robots.txt that a site bot can come but only X times per
| hour etc. At least we move from a binary scrape or no scrape to a
| spectrum then.
| willguest wrote:
| > The Internet as we have known it for the past three decades is
| rapidly changing, but one thing remains constant: it is built on
| trust.
|
| I think we've been using different internets. The one I use
| doesn't seem to be built on trust at all. It seems to be
| constantly syphoning data from my machine to feed the data
| vampires who are, apparently, additing to (I assume, blood-
| soaked) cookies
| jgrall wrote:
| Ain't that the truth.
| seydor wrote:
| > it is built on trust.
|
| This is funny coming from Cloudflare, the company that blocks
| most of the internet from being fetched with antispam checks even
| for a single web request. The internet we knew was open and not
| trusted , but thanks to companies like Cloudflare, now even the
| most benign , well meaning attempt to GET a website is met with a
| brick wall. The bots of Big Tech, namely Google, Meta and Apple
| are of course exempt from this by pretty much every website and
| by cloudflare. But try being anyone other than them , no luck.
| Cloudflare is the biggest enabler of this monopolistic behavior
|
| That said, why does perplexity even need to crawl websites? I
| thought they used 3rd party LLMs. And those LLMs didn't ask
| anyones permission to crawl the entire 'net.
|
| Also the "perplexity bots" arent crawling websites, they fetch
| URLs that the users explicitly asked. This shouldnt count as
| something that needs robots.txt access. It's not a robot randomly
| crawling, it's the user asking for a specific page and basically
| a shortcut for copy/pasting the content
| pphysch wrote:
| Spam and DDOS are serious problems, it's not fair to suggest
| Cloudflare is just doing this to gatekeep the Internet for its
| own sake.
| seydor wrote:
| It's definitely not a DDOS when it's a single http request
| per year. I don't know if they do it on purpose but the fact
| is none of the big tech crawlers are limited.
| zaphar wrote:
| This is most attributable to the fact that traffic is
| essentially anonymous so the source ip address is the best
| that a service can do if it's trying to protect an
| endpoint.
| ok123456 wrote:
| ovh does a good job with ddos
| rat9988 wrote:
| Don't they need a search index?
| jklinger410 wrote:
| > That said, why does perplexity even need to crawl websites?
|
| So you just came here to bitch about Cloudflare? It's wild to
| even comment on this thread if this does not make sense to you.
|
| They're building a search index. Every AI is going to struggle
| at being a tool to find websites & business listings without a
| search index.
| Taek wrote:
| We're moving progressively in the direction of "pages can't be
| served for free anymore". Which, I don't think is a problem,
| and in fact I think it's something we should have addressed a
| long time ago.
|
| Cloudflare only needs to exist because the server doesn't get
| paid when a user or bot requests resources. Advertising only
| needs to exist because the publisher doesn't get paid when a
| user or bot requests resources.
|
| And the thing is... people already pay for internet. They pay
| their ISP. So people are perfectly happy to pay for resources
| that they consume on the Internet, and they already have an
| infrastructure for doing so.
|
| I feel like the answer is that all web requests should come
| with a price tag, and the ISP that is delivering the data is
| responsible for paying that price tag and then charging the
| downstream user.
|
| It's also easy to ratelimit. The ISP will just count the price
| tag as 'bytes'. So your price could be 100 MB or whatever
| (independent of how large the response is), and if your
| internet is 100 mbps, the ISP will stall out the request for 8
| seconds, and then make it. If the user aborts the request
| before the page loads, the ISP won't send the request to the
| server and no resources are consumed.
| BolexNOLA wrote:
| My first reaction: This solution would basically kill what
| little remaining fun there is to be had browsing the Internet
| and all but assure no new sites/smaller players will ever see
| traffic.
|
| Curious to hear other perspectives here. Maybe I'm over
| reacting/misunderstanding.
| armchairhacker wrote:
| Depending on the implementation (a big if) it would help
| smaller websites, because it would make hosting much
| cheaper. ISPs don't choose what sites users visit, only
| what they pay. As long as the ISP isn't giving significant
| discounts to visiting big sites (just charging a fixed rate
| per bytes downloads and uploaded) and charging something
| reasonable, visiting a small site would be so cheap (a few
| cents at most, but more likely <1 cent) users won't weigh
| cost at all.
| BolexNOLA wrote:
| But users depend on major sites like google [insert
| service] still and will prioritize their usage
| accordingly like limited minutes and texts back in the
| day, right?
| armchairhacker wrote:
| Networking is so cheap, unless ISPs drastically inflate
| their price, users won't care.
|
| The average American allegedly* downloads
| 650-700GB/month, or >20GB/day. 10MB is more than enough
| for a webpage (honestly, 1MB is usually enough), so that
| means on average, ISPs serve over 2000 webpages worth of
| data per day. And the average internet plan is
| allegedly** $73/month, or <$2.50/day. So $2.50 gets you
| over 2000 indie sites.
|
| That's cheap enough, wrapped in a monthly bill, users
| won't even pay attention to what sites they visit. The
| only people hurt by an ideal (granted, _ideal_ )
| implementation are those who abuse fixed rates and
| download unreasonable amounts of data, like web crawlers
| who visit the same page seconds apart for many pages in
| parallel.
|
| * https://www.astound.com/learn/internet/average-
| internet-data...
|
| ** https://www.nerdwallet.com/article/finance/how-much-
| is-inter...
| brookst wrote:
| Wait, so the ISPs do from taking $73/user home today to
| taking $0/user home tomorrow under this plan?
| BolexNOLA wrote:
| Yeah same reaction here - there's no world in which ISP's
| would agree to this and even if they did I don't want to
| add them to my list of utilities I have to regularly
| fight with over claimed vs. actual usage like I do with
| my power/water/gas companies.
| Analemma_ wrote:
| If site operators can't afford the costs of keeping sites
| up in the face of AI scraping, the new/smaller sites are
| gone anyway.
| BolexNOLA wrote:
| Maybe not but we are not realistically in an either/or
| scenario here.
| dabockster wrote:
| > We're moving progressively in the direction of "pages can't
| be served for free anymore". Which, I don't think is a
| problem, and in fact I think it's something we should have
| addressed a long time ago.
|
| I agree, but your idea below that is overly complicated. You
| can't micro-transact the whole internet.
|
| That idea feels like those episodes of Star Trek DS9 that
| take place on Feregenar - where you have to pay admission and
| sign liability wavers to even walk on the sidewalk outside.
| It's not a true solution.
| vineyardmike wrote:
| > You can't micro-transact the whole internet.
|
| I agree that end-users cannot handle micro transactions
| across the whole internet. That said, I would like to point
| out that most of the internet is blanketed in ads and ads
| involve tons of tiny quick auctions and micro transactions
| that occur on each page load.
|
| It is totally possible for a system to evolve involving
| tons of tiny transactions across page loads.
| edoceo wrote:
| Remember Flattr?
| helloplanets wrote:
| You could argue that the suggested system is actually
| much simpler than the one we currently have for the sites
| that are "free", aka funded with ads.
|
| The lengths Meta and the like go to in order to maximize
| clickthroughs...
| sellmesoap wrote:
| > You can't micro-transact the whole internet.
|
| Clearly you don't have the lobes for business /s
| Taek wrote:
| The presented solution has invisible UX via layering it
| into existing metered billing.
|
| And, the whole internet is already micro-transactioned!
| Every page with ads is doing a bidding war and spending
| money on your attention. The only person not allowed to bid
| is yourself!
| seer wrote:
| Hah still remember the old "solving the internet with hate"
| idea from Zed Shaw in the glory days of Ruby on Rails.
|
| https://weblog.masukomi.org/2018/03/25/zed-shaws-utu-
| saving-...
|
| I do believe we will end there eventually, with the emerging
| tech like Brazil's and India's payment architectures it
| should be a possibility in the coming decades
| debesyla wrote:
| Wouldn't this lead to pirated page clones where customer pays
| less for same-ish content, and less, all the way down to
| essentially free?
|
| Because I as an user would be glad to have "free sites only"
| filter, and then just steal content :))
|
| But it's an interesting idea and thought experiment.
| armchairhacker wrote:
| That's fine. The point for website owners isn't to make
| money, it's to not spend money hosting (or more
| specifically, to pay a small fixed rate hosting). They want
| people to see the content; if someone makes the content
| more accessible, that's a good thing.
| mapontosevenths wrote:
| You ignore the issue of motivation. Most web content
| exists because someone wants to make money on it. If the
| content creator can't do that, they will stop producing
| content.
|
| These AI web crawlers (Google, Perplexity, etc) are self-
| cannibalizing robots. They eat the goose that laid the
| golden egg for breakfast, and lose money doing it most of
| the time.
|
| If something isn't done to incentivize content creators
| again eventually there will be only walled-gardens and
| obsolete content left for the cannibals.
| armchairhacker wrote:
| AFAIK, currently creators get money while not charging
| for users because of ads.
|
| While I don't blame creators for using ads now, I don't
| think they're a long-term solution. Ads are already
| blocked when people visit the site with ad blockers,
| which are becoming more popular. Obvious sponsored
| content may be blocked with the ads, and non-obvious
| sponsored content turns these "creators" into "shills"
| who are inauthentic and untrustworthy. Even without
| Google summaries, ad revenue may decrease over time as
| advertisers realize they aren't effective or want more
| profit; even if it doesn't, it's my personal opinion that
| society should decrease the overall amount of ads.
|
| Not everyone creates only for money, the best only create
| for enough money to sustain themselves. A long-term
| solution is to expand art funding (e.g. creators apply
| for grants with their ideas and, if accepted, get paid a
| fixed rate to execute them) or UBI. Then media can be
| redistributed, remixed, etc. without impacting creators'
| finances.
| Terretta wrote:
| Pretty sure this "most" motivation means it's not a
| golden egg. It's SEO slop.
|
| If only the one in ten thousand with something to share
| are left standing to share it, no manufactured content,
| that's a fine thing.
| Terretta wrote:
| Strongly agree with this armchair POV. Btw it doesn't
| cost much to host markdown.
| nazcan wrote:
| I think value is not proportional to bytes - an AI only needs
| to read a page once to add it to its model, and then served
| the effectively cached data many times.
| novok wrote:
| The reason why that didn't work was because regulations made
| micropayments too expensive, and the government wants it that
| way to keep control over the financial system.
| OptionOfT wrote:
| > We're moving progressively in the direction of "pages can't
| be served for free anymore". Which, I don't think is a
| problem, and in fact I think it's something we should have
| addressed a long time ago.
|
| But it's done through a bait and switch. They serve the full
| article to Google, which allows Google to show you excerpts
| that you have to pay for.
|
| It would be better if Google shows something like PAYMENT
| REQUIRED on top, at least that way I know what I'm getting
| at.
| mh- wrote:
| _> They serve the full article to Google, which allows
| Google to show you excerpts that you have to pay for._
|
| I'm old enough to remember when that was grounds for
| getting your site removed from Google results - "cloaking"
| was against the rules. You couldn't return one result for
| Googlebot, and another for humans.
|
| No idea when they stopped doing that, but they obviously
| have let go of that principle.
| dspillett wrote:
| I remember that too, along with high-profile punishments
| for sites that were keyword stuffing (IIRC a couple of
| decades ago BMW were completely unlisted for a time for
| this reason).
|
| I think it died largely because it became impossible top
| police with any reliability, and being strict about it
| would remove too much from Google's index because many
| sites are not easily indexable without them providing a
| "this is the version without all the extra round-trips
| for ad impressions and maybe a login needed" variant to
| common search engines.
|
| Applying the rule strictly would mean that sites
| implementing PoW tricks like Anubis to reduce unwanted
| bot traffic would not be included in the index if they
| serve to Google without the PoW step.
|
| I can't say I like that this has been legitimised even
| for the (arguably more common) deliberate bait & switch
| tricks is something I don't like, but (I think) I
| understand why the rule was allowed to slide.
| chromatin wrote:
| 402 Payment Required
|
| https://developer.mozilla.org/en-
| US/docs/Web/HTTP/Reference/...
|
| Sadly development along these lines has not progressed. Yes,
| Google Cloud and other services may return it and require
| some manual human intervention, but I'd love to see
| _automatic payment negotiation_.
|
| I'm hopeful that instant-settlement options like Bitcoin
| Lightning payments could progress us past this.
|
| https://docs.lightning.engineering/the-lightning-
| network/l40...
|
| https://hackernoon.com/the-resurgence-of-http-402-in-the-
| age...
| makingstuffs wrote:
| As time passes I'm more certain in the belief that the
| internet will end up being a licensed system with insanely
| high barriers to entry which will stop your average dev from
| even being able to afford deploying a hobby project on it.
|
| Your idea of micro transacting web requests would play into
| it and probably end up with a system like Netflix where your
| ISP has access to a set of content creators to whom they
| grant 'unlimited' access as part of the service fee.
|
| I'd imagine that accessing any content creators which are not
| part of their package will either be blocked via a paywall
| (buy an addon to access X creators outside our network each
| month) or charged at an insane price per MB as is the case
| with mobile data.
|
| Obvious this is all super hypothetical but weirder stuff has
| happened in my lifetime
| AlexandrB wrote:
| A scary observation in light of another front page article
| right now: https://news.ycombinator.com/item?id=44783566
|
| If pages can't be served for free, all internet content is at
| the mercy of payment processors and their ideas of "brand
| safety".
| dspillett wrote:
| "Free" could have a number of meanings here. Free to the
| viewer, free to the hoster, free to the creator, etc...
|
| That content can't be served entirely for free doesn't mean
| that _all_ content will require payment, and so is subject
| to issues with payment processors, just that some things
| may gravitate back to a model where it costs a small amount
| to host something (i.e. pay for home internet and host bits
| off that, or you might have VPS out there that runs tools
| and costs a few $ /yr or /month). I pay for resources to
| host my bits & bobs instead of relying on services provided
| in exchange for stalking the people looking at them, this
| is free for the viewer as they aren't even paying
| indirectly.
|
| Most things are paid for anyway, even if the person hosting
| it nor my looking at it are paying directly: adtech
| arseholes give services to people hosting content in
| exchange for the ability to stalk us and attempt to divert
| our attention. Very few sites/apps, other than play/hobby
| ones like mine or those from more actively privacy focused
| types, are free of that.
| Taek wrote:
| That's already a deep problem for all of society. If we
| don't want that to be an ongoing issue, we need to make
| sure money is a neutral infrastructure.
|
| It doesn't just apply to the web, it applies to literally
| everything that we spend money on via a third party
| service. Which is... most everything these days.
| bboygravity wrote:
| I get your thinking, but x.com is proof that simply making
| users pay (quite a lot) does not eliminate bots.
|
| The amount of "verified" paying "users" with a blue checkmark
| that are just total LLM bots is incredible on there.
|
| As long as spamming and DDOS'ing pays more than whatever the
| request costs, it will keep existing.
| Saline9515 wrote:
| Why would I pay for a page if I don't know if the content is
| what I asked for? How much are you going to pay? How much are
| you going to charge? This will end up in SEO hell, especially
| with AI-generated pages farming paid clicks.
| Terretta wrote:
| Or, flip this, don't expect to get paid for pamphleteering?
| adrian_b wrote:
| Your theory does not match the practice of Cloudflare.
|
| Whatever method is used by Cloudflare for detecting "threats"
| has nothing to do with consuming resources on the "protected"
| servers.
|
| The so-called "threats" are identified in users that may make
| a few accesses per day to a site, transferring perhaps a few
| kilobytes of useful data on the viewed pages (besides
| whatever amount of stupid scripts the site designer has
| implemented).
|
| So certainly Cloudflare does not meter the consumed
| resources.
|
| Moreover, Cloudflare preemptively annoys any user who
| accesses for the first time a site, having never consumed any
| resources, perhaps based on irrational profiling based on the
| used browser and operating system, and geographical location.
| zer00eyz wrote:
| > The internet we knew was open and not trusted ...
| monopolistic behavior
|
| Monopolistic is the wrong word, because you have the problem
| backwards. Cloudflare isnt helping Apple/Google... It's helping
| its paying consumers and those are the only services those
| consumers want to let through.
|
| Do you know how I can predict that AI agents, the sort that end
| users use to accomplish real tasks, will never take off?
| Because the people your agent would interact with want your
| EYEBALLS for ads, build anti patterns on purpose, want to make
| it hard to unsubscribe, cancel, get a refund, do a return.
|
| AI that is useful to people will fail. For the same reason that
| no one has great public API's any more. Because every public
| companies real customers are its stock holders, and the
| consumers are simply a source of revenue. One that is modeled,
| marked to, and manipulated all in the name of returns on
| investment.
| Zak wrote:
| I disagree about AI agents, at least those that work by
| automating a web browser that a human could also use. I
| suppose Google's proposal to add remote attestation to Chrome
| might make it a little harder, but that seems to be dead for
| now (and I hope forever).
| seydor wrote:
| As agents become more useful, the monetization model will
| shift to something ... that we haven't though of yet.
| eddythompson80 wrote:
| > The bots of Big Tech, namely Google, Meta and Apple are of
| course exempt from this by pretty much every website and by
| cloudflare. But try being anyone other than them , no luck.
| Cloudflare is the biggest enabler of this monopolistic behavior
|
| Plenty of site/service owners explicitly want Google, Meta and
| Apple bots (because they believe they have a symbiotic
| relationship with it) and don't want _your_ bot because they
| view you as, most likely, parasitic.
| seydor wrote:
| they didnt seem to mind when openai et al. took all their
| content to train LLMs when they were still parasites that
| didn't have a symbiotic relationship. This thinking is kind
| of too pro-monopolist for me
| eddythompson80 wrote:
| Pretty sure they DID mind that. It's what the whole post is
| about.
| golergka wrote:
| That's a good thing. You want an LLM to know about product
| or service you are selling and promote it to its users.
| Getting into the training data is the new SEO.
| TZubiri wrote:
| Websites and any business really, have the right to impose
| terms of use and deny service.
|
| Anyone circumventing bans is doing something shitty and ilegal,
| see the computer fraud and abuse act and craiglist v 3taps.
|
| "And those LLMs didn't ask anyones permission to crawl the
| entire 'net."
|
| False, openai respects robots.txt, doesnt mask ips, paid a
| bunch of $ to reddit.
|
| You either side with the law or with criminals.
| seydor wrote:
| Is that also how e.g. antrhopic trained on libgen?
|
| You can't even say the same thing about openAI because we
| don't know the corpus they train their models on.
| binarymax wrote:
| Here's how perplexity works:
|
| 1) It takes your query, and given the complexity might expand
| it to several search queries using an LLM. ("rephrasing")
|
| 2) It runs queries against a web search index (I think it was
| using Bing or Brave at first, but they probably have their own
| by now), and uses an LLM to decide which are the best/most
| relevant documents. It starts writing a summary while it dives
| into sources (see next).
|
| 3) If necessary it will download full source documents that
| popped up in search to seed the context when generating a more
| in-depth summary/answer. They do this themselves because using
| OpenAI to do it is far more expensive.
|
| #3 is the problem. Especially because SEO has really made it so
| the same sites pop up on top for certain classes of queries.
| (for example Reddit will be on top for product reviews alot).
| These sites operate on ad revenue so their incentive is to
| block. Perplexity does whatever they can in the game of
| sidestepping the sites' wishes. They are a bad actor.
|
| EDIT: I should also add that Google, Bing, and others, always
| obey robots.txt and they are good netizens. They have enough
| scale and maturity to patiently crawl a site. I wholeheartedly
| agree that if an independent site is also a good netizen, they
| should not be blocked. If Perplexity is not obeying robots.txt
| and they are impatient, they should absolutely be blocked.
| pests wrote:
| What's wrong with it downloading documents when the user asks
| it to? My browser also downloads whole documents and
| sometimes even prefetches documents I haven't even clicked on
| yet. Toss in a adblocker or reader mode and my browser also
| strips all the ads.
|
| Why is it okay for me to ask my browser to do this but I
| can't ask my LLM to do the same?
| binarymax wrote:
| There's nothing wrong with downloading documents. I do this
| in my personal search app. But if you are hammering the
| site that wants you to calm down, or bypass robots.txt,
| that's wrong.
| pests wrote:
| robots.txt is for bots and I am not one though. As a user
| I can access anything regardless of it being blocked to
| bots. There are other mechanisms like status codes to
| rate limit or authenticate if that is an issue.
| binarymax wrote:
| I'm talking about perplexity's behavior. Perhaps there's
| a point of contention on perplexity downloading a
| document on a person's behalf. I view this as if there is
| a service running that does it for multiple people, then
| it's a bot.
| layer8 wrote:
| Perplexity makes requests on behalf of its users. I would
| argue that's only illegitimate if the combined volume of
| the requests exceeds what the users would do by an order
| of magnitude or two. Maybe that's what's happening.
|
| But "for multiple people" isn't an argument IMO, since
| each of those people could run a separate service doing
| the same. Using the same service, on the contrary,
| provides an opportunity to reduce the request volume by
| caching.
| michaelt wrote:
| When Google sends people to a review website, 30% of users
| might have an adblocker, but 70% don't. And even those with
| adblockers _might_ click an affiliate link if they found
| the review particularly helpful.
|
| When ChatGPT reads a review website, though? Zero ad
| clicks, zero affiliate links.
| pests wrote:
| So if enough people used adblockers that would make them
| bad too? It's just an issue of numbers?
|
| Brave blocks ads by default. Tools like Pocket and reader
| mode disables ads.
|
| Why is it okay for some user agents but not others?
| raincole wrote:
| I'm sorry, but that's some crazy take.
|
| Sure, the internet should be open and not trusted. But physical
| reality exists. Hosting and bandwidth cost money. I trust
| Google won't DDoS my site or cost my an arbitrary amount of
| money. I won't trust bots made by random people on the internet
| in the same way. The fact that Google respects robots.txt while
| Perplexity doesn't tells you why people trust Google more than
| random bots.
| seydor wrote:
| agree to disagree , but:
|
| Google already has access to any webpage because its own
| search Crawlers are allowed by most websites, and google
| crawls recursively. Thus Gemini has an advantage of this
| synergy with google search. Perplexity does not crawl
| recursively (i presume -- therefore it does not need to
| consult robots.txt), and it doesn't have synergies with a
| major search engine.
| benregenspan wrote:
| > The bots of Big Tech, namely Google, Meta and Apple are of
| course exempt from this by pretty much every website and by
| cloudflare. But try being anyone other than them , no luck.
| Cloudflare is the biggest enabler of this monopolistic behavior
|
| The Big Tech bots provide proven value to most sites. They have
| also through the years proven themselves to respect robots.txt,
| including crawl speed directives.
|
| If you manage a site with millions of pages, and over the
| course of a couple years you see tens of new crawlers start to
| request at the same volume as Google, and some of them crawl at
| a rate high enough (and without any ramp-up period) to degrade
| services and wake up your on-call engineers, and you can't
| identify a benefit to you from the crawlers, what are you going
| to do? Are you going to pay a lot more to stop scaling down
| your cluster during off-peak traffic, or are you going to start
| blocking bots?
|
| Cloudflare happens to be the largest provider of anti-DDoS and
| bot protection services, but if it wasn't them, it'd be someone
| else. I miss the open web, but I understand why site operators
| don't want to waste bandwidth and compute on high-volume bots
| that do not present a good value proposition to them.
|
| Yes this does make it much harder for non-incumbents, and I
| don't know what to do about that.
| seydor wrote:
| it's because those SEO bots keep crawling over and over,
| which perplexity does not seem to do (considering that the
| URLS are user-requested). Those are different cases and
| robots.txt is only about the former. Cloudflare in this case
| is not doing "ddos protection" because i presume Perplexity
| does not constantly refetch or crawl or ddos the website (If
| perplexity does those things then they are guilty)
|
| https://www.robotstxt.org/faq/what.html
|
| I wonder if cloudflare users explicitly have to allow google
| or if it's pre-allowed for them when setting up cloudflare.
|
| Despite what Cloudflare wants us to think here, the web was
| always meant to be an open information network , and spam
| protection should not fundamentally change that
| characteristic.
| benregenspan wrote:
| I believe that AI crawlers are the main thing that is
| currently blocked by default when you enroll a new site. No
| traditional crawlers are blocked, it's not that the big
| incumbents are allow-listed. And I think that clearly
| marked "user request" agents like ChatGPT-User are not
| blocked by default.
|
| But at end of day it's up to the site operator, and any
| server or reverse proxy provides an easy way to block well-
| behaved bots that use a consistent user-agent.
| akagusu wrote:
| > The Big Tech bots provide proven value to most sites.
|
| They provide valeu for their companies. If you get some value
| from them it's just a side effect.
| benregenspan wrote:
| It goes without saying that they are profit-oriented. The
| point is that they historically offered a clear trade: let
| us crawl you, and we will refer traffic to you. An AI
| crawler does not provide clear value back. An AI user
| request agent might or might not provide enough clear value
| back for sites to want to participate. (Same goes for the
| search incumbents if they go all-in on LLM search results
| and don't refer much traffic back).
| busymom0 wrote:
| > why does perplexity even need to crawl websites?
|
| I was recently working on a project where I needed to find out
| the published date for a lot of article links and this came
| helpful. Not sure if it's changed recently but asking ChatGPT,
| Gemini etc didn't work and it said that it doesn't have access
| to the current websites. However, asking perplexity, it fetched
| the website in real time and gave me the info I needed.
|
| I do agree with the rest of your comment that this is not a
| random robot crawling. It was doing what a real user (me) asked
| it to fetch.
| andy99 wrote:
| Can't agree more, cloudflare is destroying the internet. We've
| entered the equivalent of when having McAffe antivirus was
| worse than having an actual virus because it slowed down your
| computer to much. These user hostile solutions have taken us
| back to dialup era page loading speeds for many sites, it's
| absurd that anyone thinks this is a service worth paying for.
| rstat1 wrote:
| So server owners are just supposed to bend over and take all
| the abuse they get from shitty bots and DDOS attacks and do
| nothing?
|
| That seems pretty unreasonable.
| spwa4 wrote:
| No they're supposed to allow scraping and information
| aggregation. That's the essence of the web: it's all text,
| crawlable, machine-readable (sort of) and parseable. Feel
| free to block ddos'es.
| bayindirh wrote:
| Feel free to crawl paywalled sites and republish them
| with discoverable links.
|
| Also after starting the crawl, you can read about Aaron
| Swartz while waiting.
| inetknght wrote:
| No, they're supposed to rally together and fight for better
| laws and enforcement of those laws. Which is, arguably,
| exactly what they've done just in a way that you and I
| don't like.
| armchairhacker wrote:
| What kind of laws and enforcement would stop a foreign
| actor from effectively DDoSing your site? What if the
| actor has (illegally) hacked tech-illiterate users so
| they have domestic residential IP addresses?
| inetknght wrote:
| > _What kind of laws and enforcement would stop a foreign
| actor from effectively DDoSing your site?_
|
| The kind of laws and enforcement that would block that
| entire country from the internet if it doesn't get its
| criminal act together.
| madrox wrote:
| There is a difference between blocking abusive behavior and
| blocking all bots. No one really cared about bot scraping
| to this degree before AI scraping for training purposes
| became a concern. This is fearmongering by Cloudflare for
| website maintainers who haven't figured out how to adapt to
| the AI era so they'll buy more Cloudflare.
| remus wrote:
| > No one really cared about bot scraping to this degree
| before AI scraping for training purposes became a
| concern. This is fearmongering by Cloudflare for website
| maintainers who haven't figured out how to adapt to the
| AI era so they'll buy more Cloudflare.
|
| I think this is an overly harsh take. I run a fairly
| niche website which collates some info which isn't
| available anywhere else on the internet. As it happens I
| don't mind companies scraping the content, but I could
| totally undrestand if someone didn't want a company
| profiting from their work in that way. No one is under an
| obligation to provide a free service to AI companies.
| adrian_b wrote:
| Unreasonable is to use such incompetent companies like
| Cloudflare, which are absolutely incapable of
| distinguishing between the normal usage of a Web site by
| humans and DDOS attacks or accesses done by bots.
|
| Only this week I have witnessed several dozen cases when
| Cloudflare has blocked normal Web page accesses without any
| possible correct reason, and this besides the normal
| annoyance of slowing every single access to any page on
| their "protected" sites with a bot check popup window.
| rstat1 wrote:
| I don't know seems like it was working as intended to me.
| CharlesW wrote:
| Ethics-free organizations and individuals like Perplexity are
| _why Cloudflare exists_. If you have a better way to solve
| the problems that they solve, the marketplace would reward
| you handsomely.
| Terretta wrote:
| Do you think users shouldn't get to have user agents or
| that "content farm ads scaffold" as a business model has a
| right to be viable? Forcing users to reward either stance
| seems unsustainable.
| CharlesW wrote:
| > _Do you think users shouldn 't get to have user agents
| or that "content farm ads scaffold" as a business model
| has a right to be viable?_
|
| Users should get to have authenticated, anonymous proxy
| user agents. Because companies like Perplexity just
| ignore `robots.txt`, maybe something like Private Access
| Tokens (PATs) with a new class for autonomous agents
| could be a solution for this.
|
| By "content farm ads scaffold", I'm not sure if you had
| Perplexity and their ads business in mind, or those
| crappy little single-serving garbage sites. In any case,
| they shouldn't be treated differently. I have no problem
| with the business model, other than that the scam only
| works because it's currently trivial to parasitically
| strip-mine and monetize other people's IP.
| adrian_b wrote:
| While the existence of Perplexity may justify the existence
| of Cloudflare, it does not justify the incompetence of
| Cloudflare, which is unable to distinguish accesses done by
| Perplexity and the like from normal accesses done by
| humans, who use those sites exactly for the purpose they
| exist, so there cannot be any excuse for the failure of
| Cloudflare to recognize this.
| bob1029 wrote:
| > when having McAffe antivirus was worse than having an
| actual virus because it slowed down your computer to much
|
| This exact same thing continues in 2025 with Windows
| Defender. The cheaper Windows Server VMs in the various cloud
| providers are practically unusable until you disable it.
|
| You can tell this stuff is no longer about protecting users
| or property when there are no meaningful workarounds or
| exceptions offered anymore. You _must_ use defender (or
| Cloudflare) unless you intend to be a naughty pirate user.
|
| I think half of this stuff is simply an elaborate power trip.
| Human egos are fairly predictable machines in aggregate.
| adrian_b wrote:
| In the previous years, I did not have many problems with
| Cloudflare.
|
| However, in the last few months, Cloudflare has become
| increasingly annoying. I suspect that they might have
| implemented some "AI" "threat" detection, which gives much
| more false positives than before.
|
| For instance, this week I have frequently been blocked when
| trying to access the home page of some sites where I am a
| paid subscriber, with a completely cryptic message "The
| action you just performed triggered the security solution.
| There are several actions that could trigger this block
| including submitting a certain word or phrase, a SQL command
| or malformed data.".
|
| The only "action" that I have done was opening the home page
| of the site, where I would then normally login with my
| credentials.
|
| Also, during the last few days I have been blocked from
| accessing ResearchGate. I may happen to hit a few times per
| day some page on the ResearchGate site, while searching for
| various research papers, which is the very purpose of that
| site. Therefore I cannot understand what stupid algorithm is
| used by Cloudflare, that it declares that such normal usage
| is a "threat".
|
| The weird part is that this blocking happens only if I use
| Firefox (Linux version). With another browser, i.e. Vivaldi
| or Chrome, I am not blocked.
|
| I have no idea whether Cloudflare specifically associates
| Firefox on Linux with "threats" or this happens because
| whatever flawed statistics Cloudflare has collected about my
| accesses have all recorded the use of Firefox.
|
| In any case, Cloudflare is completely incapable of
| discriminating between normal usage of a site by a human
| (which may be a paying customer) and "threats" caused by bots
| or whatever "threatening" entities might exist according to
| Cloudflare.
|
| I am really annoyed by the incompetent programmers who
| implement such dumb "threat detection solutions", which can
| create major inconveniences for countless people around the
| world, while the incompetents who are the cause of this are
| hiding behind their employer corporation and never suffer
| consequences proportional to the problems that they have
| caused to others.
| concinds wrote:
| > The internet we knew was open and not trusted , but thanks to
| companies like Cloudflare, now even the most benign , well
| meaning attempt to GET a website is met with a brick wall
|
| I don't think it's fair to blame Cloudflare for that. That's
| looking at a pool of blood and not what caused it: the
| bots/traffic which predate LLMs. And Cloudflare is working to
| fix it with the PrivacyPass standard (which Apple joined).
|
| Each website is freely opting-into it. No one was forced. Why
| not ask yourself why that is?
| seydor wrote:
| do you think that every well-meaning GET request should be
| treated the same way as a distributed attack ? The latter is
| the reason why people use CF not the former.
| concinds wrote:
| The line can be _extremely blurry_ (that 's putting it
| mildly), and "the latter" is not the only reason people use
| CF (actually, I wouldn't be surprised at all if it wasn't
| even the biggest reason).
| akagusu wrote:
| The reason people use Cloudflare is because they provide
| free CDN, and we have at least 10 years of content
| marketing out there telling aspiring bloggers that, if
| they use a CDN in front of their website, their shitty
| WordPress website hosted on a shady shared hosting will
| become fast.
| tonyhart7 wrote:
| well they aren't wrong
| renrutal wrote:
| How does one tell a "well-meaning" request from an attack?
| sellmesoap wrote:
| By the volume, distribution, and parameters (get and post
| body) of the requests.
| pkilgore wrote:
| > This is funny coming from Cloudflare, the company that blocks
| most of the internet from being fetched with antispam checks
| even for a single web request.
|
| Am I misunderstanding something. I (the site owner) pay
| Cloudflare to do this. It is my fault this happens, not
| Cloudflare's.
| layer8 wrote:
| You're paying Cloudflare to not get DDoS-attacked or swamped
| by illegitimate requests. GP is implying that Cloudflare
| could do a better job of not blocking legitimate, benign
| requests.
| pkilgore wrote:
| Then we're all operating with very different definitions of
| legitimate or benign!
|
| I've only ever seen a Cloudflare interstitial when viewing
| a page with my VPN on, for example -- something I'm happy
| about as a site owner and accept quite willingly as a VPN
| user knowing the kinds of abuse that occur over VPN.
| kentonv wrote:
| > the "perplexity bots" arent crawling websites, they fetch
| URLs that the users explicitly asked. This shouldnt count as
| something that needs robots.txt access. It's not a robot
| randomly crawling, it's the user asking for a specific page and
| basically a shortcut for copy/pasting the content
|
| You say "shouldn't" here, but why?
|
| There seems to be a fundamental conflict between two groups who
| each assert they have "rights":
|
| * Content consumers claim the right to use whatever software
| they want to consume content.
|
| * Content creators claim the right to control how their content
| is consumed (usually so that they can monetize it).
|
| These two "rights" are in direct conflict.
|
| The bias here on HN, at least in this thread, is clearly
| towards the first "right". And I tend to come down on this side
| myself, as a computer power user. I hate that I cannot, for
| example, customize the software I use to stream movies from
| popular streaming services.
|
| But on the other hand, content costs money to make. Creators
| need to eat. If the content creators cannot monetize their
| content, then a lot of that content will stop being made. Then
| what? That doesn't seem good for anyone, right?
|
| Whether or not you think they have the "right", Perplexity
| totally breaks web content monetization. What should we do
| about that?
|
| (Disclosure: I work for Cloudflare but not on anything related
| to this. I am speaking for myself, not Cloudflare.)
| kiratp wrote:
| The web browsers that the AI companies are about to ship will
| make requests that are indistinguishable from user requests.
| The ship on trying to save minimization has sailed.
| kentonv wrote:
| We will be able to distinguish them.
| Terretta wrote:
| "Creators" need to eat, OK, but there's no right to get paid
| to paste yesterday's recycled newspapers on my laptop screen.
| Making that unprofitable seems incredibly good for by and
| large everyone.
|
| It'd likely be a _fantastic_ good if "content creators"
| stopped being able to eat from the slop they shovel. In the
| meantime, the smarter the tools that let folks never
| encounter that form of "content", the more they will pay for
| them.
|
| There remain legitimate information creation or information
| discovery activities that nobody used to call "content". One
| can tell which they are by whether they have names pre-
| existing SEO, like "research" or "journalism" or "creative
| writing".
|
| Ad-scaffolding, what the word "content" came to mean, costs
| money to make, ideally less than the ads it provides a place
| for generate. This simple equation means the whole ecosystem,
| together with the technology attempting to perpetuate it as
| viable, is an ouroboros, eating its own effluvia.
|
| It is, I would argue, undetermined that advertising-driven
| content as a business model has a "right" to exist in today's
| form, rather than any number of other business models that
| sufficed for millennia of information and artistry before.
|
| Today LLMs serve both the generation of additional literally
| brain-less content, and the sifting of such from information
| worth using. Both sides are up in arms, but in the long run,
| it sure seems some other form of information origination and
| creativity is likely to serve everyone better.
| mastodon_acc wrote:
| As a website owner I definitely want the capability allow and
| block certain crawlers. If I say I don't want crawlers from
| Perplexity they should respect that. This sneaky evasion just
| highlights that company is not to be trusted, and I would
| definitely pay any hosting provider that helps me enforce
| blocking parasitic companies like perplexity.
| cantaccesrssbit wrote:
| I crawl 3000 RSS feeds once a week. Let me tell you! Cloudflare
| sucks. What business is it of theirs to block something that is
| meant to be accessed by everyone. Like an RSS feeds. FU
| Cloudflare.
| KomoD wrote:
| That's not Cloudflare's fault, that's the website owner's
| fault.
|
| If they want the RSS feeds to be accessible then they should
| configure it to allow those requests.
| blantonl wrote:
| Ask yourself why so many content hosting platforms utilize
| CLoudflare's services and then contrast that perspective with
| your posted one. Might enlighten you a bit to think about that
| for a second.
| bwb wrote:
| I could not keep my website up without Cloudflare given the
| level of bot and AI crawlers hammering things. I try whenever
| to do challenges, but sometimes I have to block entire AS
| blocks.
| golergka wrote:
| Ironically, cloudflare is also the reason OpenAI agent mode
| with web use isn't very usable right now. Every second time I
| asked it to do a mundane task like checking me in for a flight
| it couldn't because of cloudflare.
| tonyhart7 wrote:
| what ironic with this???
|
| we seeing many post about site owner that got hit by millions
| request because of LLM, we cant blame cloudflare for this
| because it literally neccessary evil
| daft_pink wrote:
| I'm just curious at what point ai is a crawler and at what point
| ai is a client when the user is directing the searches and the ai
| is executing them.
|
| Perplexity Comet sort of blurs the lines there as does typing
| quesitons into Claude.
| mikewarot wrote:
| So, this calls for a new type of honeytrap, content that appears
| to be human generated, and high quality, but subtly wrong,
| preferably on a commercially catastrophic way. Behind settings
| that prohibit commercial usage.
|
| It really shouldn't be hard to generate gigantic quantities of
| the stuff. Simulate old forum posts, or academic papers.
| jgrall wrote:
| This made me laugh. A form of malicious compliance.
| ascorbic wrote:
| They did that too https://blog.cloudflare.com/ai-labyrinth/
| djoldman wrote:
| The cat's out of the bag / pandora's box is opened with respect
| to AI training data.
|
| No amount of robots.txt or walled-gardening is going to be
| sufficient to impede generative AI improvement: common crawl and
| other data dumps are sufficiently large, not to mention easier to
| acquire and process, that the backlash against AI companies
| crawling folks' web pages is meaningless.
|
| Cloudflare and other companies are leveraging outrage to acquire
| more users, which is fine... users want to feel like AI companies
| aren't going to get their data.
|
| The faster that AI companies are excluded from categories of
| data, the faster they will shift to categories from which they're
| not excluded.
| tempfile wrote:
| A lot of people posting here seem to think you have a magical
| god-given right to make money from posting on the public
| internet. You do not. Effective rate-limiting of crawlers is
| important, but if the rate _is_ moderated, you do not have a
| right to decide what people do with the content. If you don 't
| believe that, get off the internet, and don't let the door hit
| you on the way out.
| ibero wrote:
| what if i want the rate set to zero?
| tempfile wrote:
| Then turn off the server?
|
| You don't have a right to say who or what can read your
| public website (this is a normative statement). You do have a
| right not to be DoS'd. If you pretend not to know what that
| means, it sounds the same as saying "you have an arbitrary
| right to decide who gets to make requests to your service",
| but it does not mean that.
| jgrall wrote:
| "You do not have a right to decide what people do with the
| content." Smh. Yes, laws be damned.
| pera wrote:
| Like many other generative AI companies, Perplexity exploits the
| good faith of the old Internet by extracting the content created
| almost entirely by normal folks (i.e. those who depend on a wage
| for subsistence) and reproducing it for a profit while removing
| the creators from the loop - even when normal folks are
| explicitly asking them to not do this.
|
| If you don't understand why this is at least slightly
| controversial I imagine you are not a normal folk.
| rapatel0 wrote:
| This is brilliant marketing and strategy from Cloudflare. They
| are pointing out bad actors and selling a service where they can
| be the private security guards for your website.
|
| I think there could be something interesting if they made a
| caching pub-sub model for data scraping. In addition or in place
| of trying to be security guards.
| czk wrote:
| the year is 2045.
|
| you've been cruising the interstate in your robotaxi, shelling
| out $150 in stablecoins at the cloudflare tollbooth. a palantir
| patrol unit pulls you over. the optimus v4 approaches your window
| and contorts its silicone face into a facsimile of concern as it
| hits you with the:
|
| "sir, have you been botting today?"
|
| immediately you remember how great you had it in the '20s when
| you used to click CAPTCHA grids to prove your humanity to dumb
| algorithms, but now the machines demand you recite poetry or weep
| on command
|
| "how much have you had to bot today?", its voice taking on an
| empathetic tone that was personalized for your particular profile
|
| "yeah... im gonna need you to exit the vehicle and take a field
| humanity test"
| benreesman wrote:
| greentext courtesy of opus prompted in greentext by me.
|
| ``` The user wants me to create a more elaborate greentext with
| deeper worldbuilding and references to internet culture/incel
| memes, but specifically asks for "tasteful" ones and explicitly
| says no violent references. I should incorporate classic
| internet/chan culture references like tendies, NEET lifestyle,
| "tfw no gf", basement dwelling, etc. while building out this
| dystopian 2045 world where proving humanity is the central
| challenge.
|
| \>be me, 2045 \>32yo NEET still living in mom's smart-basement
| \>she threatens to turn off my Neuralink if I don't go outside
| today \>grudgingly take the robotaxi to pick up tendies \>$150
| stablecoin toll at cloudflare checkpoint \>have to solve a
| moral dilemma to prove I'm not running on GPT-7 \>sweating.dll
| \>palantir unit pulls me over \>optimus v4 leans into window
| \>its facial mesh attempts "concern_expression_v2.blend"
| \>"sir, when did you last feel genuine human connection?"
| \>flashback to 2024 when the girl at McDonalds gave me extra
| honey mustard \>that was before the McBots took over
| \>"t-twenty one years ago officer" \>optimus's empathy
| subroutines activate \>"sir I need you to perform a field
| humanity test" \>get out, knees weak from vitamin D deficiency
| \>"please describe your ideal romantic partner without using
| the words 'tradwife' or 'submissive'" \>brain.exe has stopped
| responding \>try to remember pre-blackpill emotions \>"someone
| who... likes anime?" \>optimus scans my biometrics \>"stress
| patterns indicate authentic social anxiety, carry on citizen"
| \>get back in robotaxi \>it starts therapy session \>"I notice
| you ordered tendies again. Let's explore your relationship with
| your mother" \>tfw the car has better emotional intelligence
| than me \>finally get tendies from Wendy's AutoServ \>receipt
| prints with mandatory "rate your humanity score today" \>3.2/10
| \>at least I'm improving
|
| \>mfw bots are better at being human than humans \>it's over
| for carboncels ```
| decide1000 wrote:
| C'mon CF. What are you doing? You are literally breaking the
| internet with your police behaviour. Starts to look like the
| Great Firewall.
| jgrall wrote:
| Not affiliated with CF in any way. Respectfully disagree.
| Calling out bad actors is in the public interest.
| imcritic wrote:
| CF is a bad actor. They ruin internet. They own more and more
| parts of it.
| kylestanfield wrote:
| Perplexity claims that you can "use the following robots.txt tags
| to manage how their sites and content interact with Perplexity."
| https://docs.perplexity.ai/guides/bots
|
| Their fetcher (not crawler) has user agent Perplexity-User. Since
| the fetching is user-requested, it ignores robots.txt . In the
| article, it discusses how blocking the "Perplexity-User" user
| agent doesn't actually work, and how perplexity uses an anonymous
| user agent to avoid being blocked.
| nostrademons wrote:
| It's entirely possible that it's not _Perplexity_ using the
| stealth undeclared crawlers, but rather their fallback is to
| contract out to a dedicated for-pay webscraping firm that
| retrieves the desired content through unspecified means. (Some of
| these are pretty dodgy - several scraping companies effectively
| just install malware on consumer machines and then use their
| botnet to grab data for their customers.). There was a story on
| HN not long ago about the FBI using similar means to perform
| surveillance that would be illegal if the FBI did it itself, but
| becomes legal once they split the different parts up across a
| supply chain:
|
| https://news.ycombinator.com/item?id=44220860
| echo42null wrote:
| Hmm, I've always seen robots.txt more as a polite request than an
| actual rule.
|
| Sure, Google has to follow it because they're a big company and
| need to respect certain laws or internal policies. But for
| everyone else, it's basically just a "please don't" sign, not a
| legal requirement or?
| crossroadsguy wrote:
| I was recently listening to Cloudflare CEO on the Hard Fork
| podcast. He seemed to be selling a way for content creators to
| stop AI companies from profiting off such leeching. But the way
| he laid the whole thing out, adding how they are best placed to
| do this because they are gatekeepers of X% of the Internet (I
| don't recall the exact percentage), had me more concerned than I
| was at the prospect of AI companies being the front of summarised
| or interpreted consumption.
|
| He went on, upfront -- I'd give him that, to explain how he is
| expecting a certain percentage of that income that will come from
| enforcing this on those AI companies and when the AI companies
| pay up to crawl.
|
| Cloudflare already questions my humanity and then every once in a
| while blocks me with zero recourse. Now they are literally
| proposing more control and gatekeeping.
|
| Where have we all come on the Internet? Are we openly going back
| to the wild west of bounty hunters and Pinkertons (in a way)?
| skeledrew wrote:
| This is why Perplexity is my preferred deep search engine. The
| no-crawl directives don't really make sense when I'm doing
| research and want my tool of choice to be able to pull from any
| relevant source. If a site doesn't want particular users to
| access their content, put it behind a login. The only way I - and
| eventually many others - will see it in the first place anyway is
| when it pops up as a cited source in the LLM output, and there's
| an actual need to go to said source.
| remus wrote:
| > The no-crawl directives don't really make sense when I'm
| doing research and want my tool of choice to be able to pull
| from any relevant source.
|
| If you are the source I think they could make plenty of sense.
| As an example, I run a website where I've spent a lot of time
| documenting the history of a somewhat niche activity. Much of
| this information isn't available online anywhere else.
|
| As it happens I'm happy to let bots crawl the site, but I think
| it's a reasonable stance to not want other companies to profit
| from my hard work. Even more so when it actually costs me money
| to serve requests to the company!
| crazygringo wrote:
| > _but I think it 's a reasonable stance to not want other
| companies to profit from my hard work_
|
| Imagine someone at another company reads your site, and it
| informs a strategic decision they make at the company to make
| money around the niche activity you're talking about. And
| they make lots of money they wouldn't have otherwise. That's
| totally legal and totally ethical as well.
|
| The reality is, if you do hard work and make the results
| public, well you've made them public. People and corporations
| are free to profit off the facts you've made public, and they
| should be. There are certain limited copyright protections
| (they can't sell large swathes of your words verbatim), but
| that's all.
|
| So the idea that you don't want companies to profit from your
| hard work _is_ unreasonable, if you make it public. If you
| don 't want that to happen, don't make anything public.
| remus wrote:
| For me, the point is that the person who has put in the
| work then has some rights to decide how that information is
| accessed and re-used. I think it is a reaosnable position
| for someone to hold that they want individuals to be able
| to freely use some content they produced, but not for a
| company to use and profit from that same content. I think
| just saying "It's public now" lacks any nuance.
|
| Ultimately these AI tools are useful because they have
| access to huge swaths of content, and the owners of these
| tools turn a lot of revenue by selling access to these
| tools. Ultimately I think the internet will end up a much
| worse place if companies don't respect clearly established
| wishes of people creating the content, because if companies
| stop respecting things like robots.txt then people will
| just hide stuff behind logins, paywalls and frustraing
| tools like cloudflare which use heuristics to block
| malicious traffic.
| crazygringo wrote:
| > _the person who has put in the work then has some
| rights to decide how that information is accessed and re-
| used_
|
| You do, but you give up those rights when you make the
| work public.
|
| You think an author has any control over who their book
| gets lent to once somebody buys a copy? You think they
| get a share of profits when a CEO reads their book and
| they make a better decision? Of course not.
|
| What you're asking for is unreasonable. It's not
| workable. Knowledge can't be owned. Once you put it out
| there, it's out there. We have copyright and patent
| protections in specific circumstances, but that's all.
| You don't own facts, no matter how much hard work and
| research they took to figure out.
| ch_fr wrote:
| On a more human level, I think it's bleak that someone who
| makes a blog just to share stuff for fun is going to have
| most of his traffic be scrapers that distill, distort, and
| reheat whatever he's writing before serving it to potential
| readers.
| alexey-salmin wrote:
| > As it happens I'm happy to let bots crawl the site, but I
| think it's a reasonable stance to not want other companies to
| profit from my hard work.
|
| How do you square these two? Of course big companies profit
| from your work, this is why they send all these bots to crawl
| your site.
| remus wrote:
| When I said "I think it's a reasonable stance" I meant as
| in "I think it's a reasonable stance for someone to take,
| though I don't personally hold that view".
| dhanushreddy29 wrote:
| PS: perplexity is using cloudflare browser rendering to scrape
| websites
| kazinator wrote:
| Why single out Perplexity? Pretty much no crawler out there
| fetches robots.txt.
|
| robots.txt is not a blocking mechanism; it's a hint to indicate
| which parts of a site might be of interest to indexing.
|
| People started using robots.txt to lie and declare things like no
| part of their site is interesting, and so of course that gets
| ignored.
| gcbirzan wrote:
| That's not true, at all.
| kotaKat wrote:
| An AI service violating peoples' consent? Say it isn't so! Those
| damn assult-culture techbros at it again.
| bob1029 wrote:
| Has anyone bothered to properly quantify the worst case load
| (i.e., requests per second) that has been incurred by these
| scraping tools? I recall a post on HN a few weeks/months ago
| about something similar, but it seemed very light on figures.
|
| It seems to me that ~50% of the discourse occurring around AI
| providers involves the idea that a machine reading webpages on a
| regular schedule is tantamount to a DDOS attack. The other half
| seems to be regarding IP and capitalism concerns - which seem
| like far more viable arguments.
|
| If someone requesting your site map once per day is crippling
| operations, the simplest solution is to make the service not run
| like shit. There is a point where your web server becomes so fast
| you stop caring about locking everyone into a draconian content
| prison. If you can serve an average page in 200uS and your
| competition takes 200ms to do it, you have roughly 1000x the
| capacity to mitigate an aggressive scraper (or actual DDOS
| attack) in terms of CPU time.
| ch_fr wrote:
| I mean, it did happen, don't you remember in March when
| SourceHut had outages because their most expensive endpoints
| were being spammed by scrapers?
|
| Don't you remember the reason Anubis even came to be?
|
| It really wasn't that long ago, so I find all of the snarky
| comments going "erm, actually, I've yet to see any good actors
| get harmed by scraping ever, we're just reclaiming power from
| today's modern ad-ridden hellscape" pretty dishonest.
| xmodem wrote:
| Question for those in this thread who are okay with this: If I
| have endpoints that are computationally expensive server-side,
| what mechanism do you propose I could use to avoid being
| overwhelmed?
|
| The web will be a much worse place if such services are all
| forced behind captchas or logins.
| m3047 wrote:
| In 2005 I used a bot motel with Markov Chain derived dummy
| content for this exact purpose.
| alexey-salmin wrote:
| How do you make the money you need to finance these
| computationally expensive server-side endpoints?
| xmodem wrote:
| Maybe I'm a community-driven project funded by donations and
| volunteer time. Maybe I'm a local government with extremely
| limited IT budget and no in-house skills. Maybe I'm just some
| dude who maintains a hobby project that lives on a NUC under
| my desk.
| codecracker3001 wrote:
| > we were able to fingerprint this crawler using a combination of
| machine learning and network signals.
|
| what machine learning algorithms are they using? time to deploy
| them onto our websites
| madrox wrote:
| Every time there's an industry disruption there's good money to
| be made in providing services to incumbents that slow the
| transition down. You saw it in streaming, and even the internet
| at large. Cloudflare just happens to be the business filling that
| role this time.
|
| I don't really mind because history shows this is a temporary
| thing, but I hope web site maintainers have a plan B to hoping
| Cloudflare will protect them from AI forever. Whoever has an
| onramp for people who run websites today to make money from AI
| will make a lot of money.
| nialse wrote:
| Is it just me or is it rage bait? Switching up marketing a notch
| when the AI paywall did not get much media attention so far?
| Cloudflare seems to focus on enterprise marketing nowadays,
| currently geared towards the media industry, rather than the
| technical marketing suited for the HN audience. They have no
| horse in the AI race, so they're betting on the anti-AI horse
| instead to gain market share in the media sector?
| zeld4 wrote:
| Internet was built on trust, but not anymore. It's a Darwinian
| system; everyone has to find their own way to survive.
|
| Cloudflare will help their publisher to block more aggresively,
| and AI companies will up their game too. Harvest information
| online is hard labor that needs to be paid for, either to AI, or
| to human.
| hnburnsy wrote:
| Respone from Perpelexity to Tech Crunch...
|
| >Perplexity spokesperson Jesse Dwyer dismissed Cloudflare's blog
| post as a "sales pitch," adding in an email to TechCrunch that
| the screenshots in the post "show that no content was accessed."
| In a follow-up email, Dwyer claimed the bot named in the
| Cloudflare blog "isn't even ours."
| throwmeaway222 wrote:
| Change "no-crawl" to "will-sue"
|
| and see if that fixes the problem.
| fsckboy wrote:
| the internet needs micropayments (probably millipayments). if
| crawlers want to pay me a penny a page, crawl me 24-7 plz
|
| if I am willing to pay a penny a page, i and the people like me
| won't have to put up with clickwrap nonsense
|
| free access doesn't have to be shut off (ok, it will be, but it
| doesn't have to be, and doesn't that tell you something?)
|
| reddit could charge stiffer fees, but refund quality content to
| encourage better content. i've fantacized about ideas like "you
| pay upfront a deposit; you get banned, you lose your deposit;
| withdraw, have your deposit back", the goal being simplify the
| moderation task while encouraging quality.
|
| because where the internet is headed is just more and more trash.
|
| here's another idea, pay a penny per search at google/search
| engine of choice. if you don't like the results, you can take the
| penny back. google's ai can figure out how to please you. if the
| pennies don't keep coming in, they serve you ad-infested results;
| serve up ad-infested results, you can send your penny to a
| different search engine.
| zzo38computer wrote:
| I do not want to block curl and lynx. But if they claim to be
| Chrome then I don't care if Chrome is blocked
| qwerty456127 wrote:
| It's time to stop blocking crawlers and using captchas and start
| building web sites that are intentionally AI-friendly by design.
| Even before the modern LLMs, anti-scraper measures apparently
| were primarily befitting Google whose scrapers were the most
| common exception.
| hrpnk wrote:
| Previously it was all sniper and sneaker bots scanning websites
| for product availability and attempting purchases continuously to
| snipe when it comes back online.
|
| Now, it's a gazillion of AI crawlers and python crawlers, MCP
| servers that offer the same feature to anyone "building (personal
| workflow) automation" incl. bypass of various, standard
| protection mechanisms.
___________________________________________________________________
(page generated 2025-08-04 23:00 UTC)