[HN Gopher] Devs say AI crawlers dominate traffic, forcing block...
___________________________________________________________________
Devs say AI crawlers dominate traffic, forcing blocks on entire
countries
Author : LinuxBender
Score : 35 points
Date : 2025-03-25 21:42 UTC (1 hours ago)
(HTM) web link (arstechnica.com)
(TXT) w3m dump (arstechnica.com)
| ggm wrote:
| Entire country blocks are lazy, and pragmatic. The US armed
| forces at one point blocked AU/NZ on 202/8 and 203/8 on a
| misunderstanding about packets from China, also from these
| blocks. Not so useful for military staff seconded into the region
| seeking to use public internet to get back to base.
|
| People need to find better methods. And, crawlers need to pay a
| stupidity tax or be regulated (dirty word in the tech sector)
| noirscape wrote:
| They can absolutely work if you aren't expecting any traffic
| from those countries whatsoever.
|
| I don't expect any international calls... ever, so I block
| international calling numbers on my phone (since they are
| _always_ spam calls) and it cuts down on the overwhelming
| majority of them. Don 't see why that couldn't apply to
| websites either.
| ggm wrote:
| Sure. Absolutely works. Right up until it doesn't. I think
| the MIL was the wrong people to assume "we will never need
| packets from these network blocks"
|
| The other thing is that phone numbers follow a numbering
| scheme where +1 is north america and +64 is NZ. Its easy to
| know the longterm geographic consequence of your block,
| modulo faked out CLID. IP packets don't follow this logic and
| Amazon can deploy AWS nodes with IPs acquired in Asia, in any
| DC they like. The smaller hosting companies don't say the IP
| range they route for banks have no pornographers on them.
|
| It's really not sensible to use IP blocks except for the very
| specific cases like yours: "I never terminate international
| calls" is the NAT of firewalls: "I don't want incoming
| packets from strangers" sure the cheapest path is to block
| entire swathes of IPv4 and IPv6. But if you are in general
| service delivery, that rarely works. If you ran a business
| doing trade in China, you'd remove that block immediately.
| kragen wrote:
| It depends on whether the information on the website is
| supposed to be publicly available or not. "This information
| is publicly available except to people from Israel" sends a
| really terrible message.
| grotorea wrote:
| Is this stuff only affecting the not for profit web? What are the
| for profit sites doing? I haven't seen Anubis around the web
| elsewhere. Are we just going to get more and tighter login walls
| and send everything into the deep web?
| surfingdino wrote:
| I think we killed the old web. We'll see new ways of
| communicating, publishing, and gathering over the internet.
| It's sad, but it's also exciting.
| burkaman wrote:
| For profit sites are making deals directly with the AI
| companies so they can get some more of that profit.
| xena wrote:
| Wow it is so surreal to see a project of mine on Ars Technica!
| It's such an honor!
| true_blue wrote:
| On the few sites I've seen using it so far, it's been a more
| pleasant (and cuter) experience for me than the captchas I'd
| probably get otherwise. good work!
| edoloughlin wrote:
| I'm being trite, but if you can detect an AI bot, why not just
| serve them random data? At least they'll be sharing some of the
| pain they inflict.
| xena wrote:
| You can detect the patterns in aggregate. You can't detect it
| easily at an individual request level.
| noirscape wrote:
| Bandwidth isn't free, not at the volume these crawlers scrape
| at; serving them random data (for example by leading them down
| an endless tarpit of links that no human would end up visiting)
| would still incur bandwidth fees.
|
| Also it's not identifiable AI bot traffic that's detected (they
| mask themselves as regular browsers and hop between domestic IP
| addresses when blocked), it's just really obviously AI scraper
| traffic in aggregate: other mass crawlers have no benefit from
| bringing down their host sites, _except_ for AI.
|
| A search engine has nothing if it brings down the site they're
| scraping (and has everything to gain from identifying itself as
| a search engine to try and get favorable request speeds - the
| only thing they'd need to check is if the site in question
| isn't serving different data, but that's much cheaper), same
| with an archive scraper and those two are pretty much the main
| examples I can think of for most scraping traffic.
| charcircuit wrote:
| >Bandwidth isn't free
|
| Via peering agreements it is.
| nosianu wrote:
| You mean like this?
|
| [2025-03-19] https://blog.cloudflare.com/ai-labyrinth/
|
| > Trapping misbehaving bots in an AI Labyrinth
|
| > _Today, we're excited to announce AI Labyrinth, a new
| mitigation approach that uses AI-generated content to slow
| down, confuse, and waste the resources of AI Crawlers and other
| bots that don't respect "no crawl" directives._
| tedunangst wrote:
| > It remains unclear why these companies don't adopt more
| collaborative approaches and, at a minimum, rate-limit their data
| harvesting runs so they don't overwhelm source websites.
|
| If the target goes down after you scrape it, that's a feature.
___________________________________________________________________
(page generated 2025-03-25 23:00 UTC)