Subj : Cloudflares pay-per-crawl is built to fail. Heres why To : All From : TechnologyDaily Date : Fri Jul 18 2025 11:45:07 Cloudflares pay-per-crawl is built to fail. Heres why Date: Fri, 18 Jul 2025 10:28:40 +0000 Description: When Cloudflare announced its new Pay-Per-Crawl marketplace, some people saw a breakthrough. FULL STORY ====================================================================== When Cloudflare announced its new Pay-Per-Crawl marketplace, some people saw a breakthrough. The idea is that if AI companies want to crawl your website to train their models, they should compensate you for the use of your content. As the CEO of a legal AI company recently sued for scraping public data, Id love for this to work. However, it wont, at least not in this way. Last year, my company, Caseway, was sued by CanLII, the operator of Canadas free legal decisions database , for allegedly using publicly available court data without a license. Ive had a front-row seat to the vagueness of the legal rules surrounding AI scraping. And Ive watched the wave of litigation since. The New York Times sued OpenAI and Microsoft for using millions of their paywalled articles to train GPT-4. News Corp went after Perplexity for scraping Wall Street Journal content to generate answer pages. GitHub Copilot faces class actions from developers whose open-source code was ingested without attribution. Even Reddit sued Anthropic for allegedly training Claude on its forums without consent. Scraping is how the AI industry was built, at least for many AI companies. At first glance, Cloudflares new system appears to be a step forward. The company sits in front of 20% of the internet, so if anyone can enforce crawl permissions at scale, its them. Cloudflare states that websites can now block AI crawlers by default and require them to pay for each page request. Instead of an arms race over bot blockers and sneaky scrapers, maybe theres a chance to align incentives. However, this marketplace makes two significant mistakes and overlooks one even more substantial issue. Not All Pages Are Equal The first issue is pricing. Right now, Pay-Per-Crawl treats every page as a billable unit. But come on, a Pulitzer-winning investigation that lasted six months doesnt have the same value as a transcript of a traffic court decision already in the public domain, which a website like CanLII didnt even create (a judge made it). Publishers that invest millions in original journalism or spend years on documentation and research also wont settle for a flat crawl fee that applies to a government form or FAQ page. Cloudflares system doesnt account for that nuance. So most AI companies (including my company, Caseway) wont buy in. Why would we pay premium rates for generic content that they can get elsewhere or that theyve already ingested from Common Crawl? Or, more importantly, that a website is hosting the content on behalf of others, and they are a non-profit? Meta disclosed that 67% of its LLaMA 1 model was trained on Common Crawl data, which is raw web content collected without payment or consent. OpenAIs GPT-3 also used hundreds of billions of tokens from Common Crawl. These datasets are massive, free, and already full of scraped content from across the web. Unless youre offering something significantly better or are legally required to pay, why would an AI firm suddenly switch to paying by the page? And that brings us to the second problem. Enforcement Is a Fantasy Lets say youre a serious artificial intelligence lab or company. Youve seen the lawsuits, and you want to stay compliant. Cloudflares Pay-Per-Crawl system might help you track access and pay for what you use. But thats not who Cloudflare needs to stop. The AI companies most likely to abuse your content arent going to sign up, add a payment method, and politely negotiate crawl rights. Theyll simply spoof their user agent, rotate IP addresses, or use a third-party proxy (maybe in India or China) to obtain the data anyway. And theres nothing Cloudflare can do about it once the traffic appears to be from a human browser or a generic scraper. Will a non-profit like CanLII pursue a company in Shanghai? Good luck convincing a judge in China to care about free court decisions in Canada. According to Digiday, media companies like Skift saw OpenAIs GPTBot hit their sites over 50,000 times a day despite explicitly disallowing it in their robots.txt files. Ziff Davis (owner of PCMag and Mashable) reported that OpenAIs crawler increased its activity even after being told to stop. And Wikimedia said AI scrapers caused a 50% surge in bandwidth costs this year alone. So, enforcement depends entirely on good faith. But thats wishful thinking. Publishers Need Leverage, Not Just Permission I get why publishers are excited about Pay-Per-Crawl. Ive been in this business long enough to see how the value chains been flipped. I previously ran a lawyer review platform with over 1.1 million lawyers. Traffic, discovery, and reputation are used to drive value. However, now AI platforms are building sticky interfaces that pull answers directly from content, eliminating the need for a single visitor to return. Cloudflares marketplace attempts to address this, but it remains built on the premise that consent and compensation are optional. If AI companies want to train on your data, theyll pay. If not, they wont. What publishers need isnt a crawler paywall. They need actual leverage, which includes legal clarity, enforceable rules, and collective bargaining power. Some of that might come through the courts, but I doubt it. The pace of litigation is glacial. More promising are industry coalitions advocating for default protections, such as requiring opt-ins, licensing standards, or even machine-readable do not train signals. There are also startups like Tollbit that enable publishers to detect AI bots and serve them alternate versions of content, or tollgates, automatically. These are blunt possible solutions. However, they shift power back to the people who are actually creating content. Thats the right direction. The Bottom Line Cloudflares Pay-Per-Crawl is a clever idea. Its the first genuine attempt to attach a meter to data before it gets swallowed by the AI engine. And for publishers already using Cloudflare, its a step toward asserting control. But it wont work at scale. It fails to distinguish between high-value and low-value content. It relies on the honour system for enforcement. And it assumes that some large AI companies, who have trained billion-dollar models on free web data for years, will suddenly start paying for data. If anything, Pay-Per-Crawl exposes the more profound truth This fight is about power. This wars just getting started. I tried 70+ best AI tools . This article was produced as part of TechRadarPro's Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro ====================================================================== Link to news story: https://www.techradar.com/pro/cloudflares-pay-per-crawl-is-built-to-fail-heres -why --- Mystic BBS v1.12 A47 (Linux/64) * Origin: tqwNet Technology News (1337:1/100) .