[HN Gopher] Devs say AI crawlers dominate traffic, forcing block...
       ___________________________________________________________________
        
       Devs say AI crawlers dominate traffic, forcing blocks on entire
       countries
        
       Author : LinuxBender
       Score  : 35 points
       Date   : 2025-03-25 21:42 UTC (1 hours ago)
        
 (HTM) web link (arstechnica.com)
 (TXT) w3m dump (arstechnica.com)
        
       | ggm wrote:
       | Entire country blocks are lazy, and pragmatic. The US armed
       | forces at one point blocked AU/NZ on 202/8 and 203/8 on a
       | misunderstanding about packets from China, also from these
       | blocks. Not so useful for military staff seconded into the region
       | seeking to use public internet to get back to base.
       | 
       | People need to find better methods. And, crawlers need to pay a
       | stupidity tax or be regulated (dirty word in the tech sector)
        
         | noirscape wrote:
         | They can absolutely work if you aren't expecting any traffic
         | from those countries whatsoever.
         | 
         | I don't expect any international calls... ever, so I block
         | international calling numbers on my phone (since they are
         | _always_ spam calls) and it cuts down on the overwhelming
         | majority of them. Don 't see why that couldn't apply to
         | websites either.
        
           | ggm wrote:
           | Sure. Absolutely works. Right up until it doesn't. I think
           | the MIL was the wrong people to assume "we will never need
           | packets from these network blocks"
           | 
           | The other thing is that phone numbers follow a numbering
           | scheme where +1 is north america and +64 is NZ. Its easy to
           | know the longterm geographic consequence of your block,
           | modulo faked out CLID. IP packets don't follow this logic and
           | Amazon can deploy AWS nodes with IPs acquired in Asia, in any
           | DC they like. The smaller hosting companies don't say the IP
           | range they route for banks have no pornographers on them.
           | 
           | It's really not sensible to use IP blocks except for the very
           | specific cases like yours: "I never terminate international
           | calls" is the NAT of firewalls: "I don't want incoming
           | packets from strangers" sure the cheapest path is to block
           | entire swathes of IPv4 and IPv6. But if you are in general
           | service delivery, that rarely works. If you ran a business
           | doing trade in China, you'd remove that block immediately.
        
           | kragen wrote:
           | It depends on whether the information on the website is
           | supposed to be publicly available or not. "This information
           | is publicly available except to people from Israel" sends a
           | really terrible message.
        
       | grotorea wrote:
       | Is this stuff only affecting the not for profit web? What are the
       | for profit sites doing? I haven't seen Anubis around the web
       | elsewhere. Are we just going to get more and tighter login walls
       | and send everything into the deep web?
        
         | surfingdino wrote:
         | I think we killed the old web. We'll see new ways of
         | communicating, publishing, and gathering over the internet.
         | It's sad, but it's also exciting.
        
         | burkaman wrote:
         | For profit sites are making deals directly with the AI
         | companies so they can get some more of that profit.
        
       | xena wrote:
       | Wow it is so surreal to see a project of mine on Ars Technica!
       | It's such an honor!
        
         | true_blue wrote:
         | On the few sites I've seen using it so far, it's been a more
         | pleasant (and cuter) experience for me than the captchas I'd
         | probably get otherwise. good work!
        
       | edoloughlin wrote:
       | I'm being trite, but if you can detect an AI bot, why not just
       | serve them random data? At least they'll be sharing some of the
       | pain they inflict.
        
         | xena wrote:
         | You can detect the patterns in aggregate. You can't detect it
         | easily at an individual request level.
        
         | noirscape wrote:
         | Bandwidth isn't free, not at the volume these crawlers scrape
         | at; serving them random data (for example by leading them down
         | an endless tarpit of links that no human would end up visiting)
         | would still incur bandwidth fees.
         | 
         | Also it's not identifiable AI bot traffic that's detected (they
         | mask themselves as regular browsers and hop between domestic IP
         | addresses when blocked), it's just really obviously AI scraper
         | traffic in aggregate: other mass crawlers have no benefit from
         | bringing down their host sites, _except_ for AI.
         | 
         | A search engine has nothing if it brings down the site they're
         | scraping (and has everything to gain from identifying itself as
         | a search engine to try and get favorable request speeds - the
         | only thing they'd need to check is if the site in question
         | isn't serving different data, but that's much cheaper), same
         | with an archive scraper and those two are pretty much the main
         | examples I can think of for most scraping traffic.
        
           | charcircuit wrote:
           | >Bandwidth isn't free
           | 
           | Via peering agreements it is.
        
         | nosianu wrote:
         | You mean like this?
         | 
         | [2025-03-19] https://blog.cloudflare.com/ai-labyrinth/
         | 
         | > Trapping misbehaving bots in an AI Labyrinth
         | 
         | > _Today, we're excited to announce AI Labyrinth, a new
         | mitigation approach that uses AI-generated content to slow
         | down, confuse, and waste the resources of AI Crawlers and other
         | bots that don't respect "no crawl" directives._
        
       | tedunangst wrote:
       | > It remains unclear why these companies don't adopt more
       | collaborative approaches and, at a minimum, rate-limit their data
       | harvesting runs so they don't overwhelm source websites.
       | 
       | If the target goes down after you scrape it, that's a feature.
        
       ___________________________________________________________________
       (page generated 2025-03-25 23:00 UTC)