[HN Gopher] AI bots are destroying Open Access
       ___________________________________________________________________
        
       AI bots are destroying Open Access
        
       Author : dhacks
       Score  : 67 points
       Date   : 2025-03-25 18:15 UTC (4 hours ago)
        
 (HTM) web link (go-to-hellman.blogspot.com)
 (TXT) w3m dump (go-to-hellman.blogspot.com)
        
       | nathanaldensr wrote:
       | The only way--the _only_ way--to solve these issues is with web
       | servers requiring that all clients authenticate with a credential
       | that is provably tied to a real-world entity--person or corporate
       | entity--so that legal recourse is available to the server owner
       | when abuse occurs. The internet is no longer high-trust; we 're
       | running web servers the same way we'd run an honor-system store
       | where people just come in and steal, anonymously, and with no
       | recourse.
        
         | ronsor wrote:
         | I guess I'm done using the Internet then.
        
         | batata_frita wrote:
         | We're just ruining the last good part of internet
        
         | eesmith wrote:
         | $125 and I can start an LLC.
         | 
         | That's a real-world corporate entity. Recourse ends at the
         | "limited liability" in LLC.
         | 
         | Make that LLC owned by another? Offshore ownership? Might take
         | a few thousand bucks.
        
           | ronsor wrote:
           | Cheaper than that, and the Secretary of State doesn't
           | actually verify anything.
        
         | JohnFen wrote:
         | The truth and tragedy of this is very clear to me. I am hoping
         | this is something that will eventually be solved, but I don't
         | expect it. These companies are on a burn-and-pillage rampage.
        
         | Verdex wrote:
         | I don't know. Once I know the who the legal entity is who I
         | assert is a bad actor, I'm not sure there is really an recourse
         | to be had.
         | 
         | Your honor, these people are visiting my website in a way that
         | makes me sad? I feel that we would need to encode bad behavior
         | in a legally reasonable way first.
         | 
         | And not to mention that you'll have to bring legal disputes a
         | legal entity at a time. And some of these legal entities have
         | very deep pockets.
         | 
         | Unless the suggestion is that internet providers are all going
         | to join together to stand up for the little guy? Somehow I'm
         | not optimistic.
         | 
         | (Finally IPv6 has taken decades to get to where it is today.
         | Somehow I don't see legally attributable IP traffic extension
         | to be ready and deployed any faster)
        
         | akomtu wrote:
         | In reality, users will have to show passport to use the
         | internet, while corporations will hide behind a "Corporate ID"
         | that's whitelisted in all authenticator services, because those
         | are also corporations. So you'll keep getting millions of
         | requests from corp234 and corp456 with no legal recourse
         | against them.
        
       | xnx wrote:
       | Why are "AI" bots generating so much fuss. Is it because there
       | are so many of them? Is it because AI companies are each writing
       | their own (bad) crawlers instead of using existing ones?
        
         | jsheard wrote:
         | AI bots operators are financially incentivized to not be good
         | citizens, they want as much data as possible as fast as
         | possible and don't care who they piss off in the process. Plus
         | for now at least they have effectively unlimited money to throw
         | at bandwidth, storage, IP addresses, crawling with full-blown
         | headless browsers, etc.
        
           | quectophoton wrote:
           | And it gets worse.
           | 
           | For now they are probably paying to use residential IP
           | addresses that they get from other services that sell them
           | (and these services get them from people who willingly sell
           | some of their bandwidth for cents).
           | 
           | But I think it won't be long before we start seeing the AI
           | companies having each _their own_ swarm of residential IP
           | addresses by selling _themselves_ a browser extension or
           | mobile app, saying something like:
           | 
           | "Get faster results (or a discount) by using our extension!
           | By using your own internet connection to fetch the required
           | context, you won't need to share computing resources with
           | other users, thusly increasing the speed of your queries!
           | Plus, since you don't use our servers, that means we can pass
           | our savings to you as a discount!"
           | 
           | Then in small letter saying they use your connection for
           | helping others with their queries, or being more eco-friendly
           | because sharing, or whatever they come up with to justify
           | this.
        
             | duttonw wrote:
             | OpenAI has 'already' got a browser extension. Who knows
             | when this is 'enabled'. We already had the 'honey' debacle
             | with Amazon/ebay referral link stealing
        
         | loloquwowndueo wrote:
         | This is explained in the article. Tl;dr for whichever reason
         | these AI bots behave nothing like the web crawlers of old. To
         | quote TFA:
         | 
         | > The current generation of bots is mindless. They use as many
         | connections as you have room for. If you add capacity, they
         | just ramp up their requests. They use randomly generated user-
         | agent strings. They come from large blocks of IP addresses.
         | They get trapped in endless hallways. I observed one bot asking
         | for 200,000 nofollow redirect links pointing at Onedrive,
         | Google Drive and Dropbox. (which of course didn't work, but
         | Onedrive decided to stop serving our Canadian human users).
         | They use up server resources - one speaker at Code4lib
         | described a bug where software they were running was using 32
         | bit integers for session identifiers, and it ran out!
        
         | flakeoil wrote:
         | They are maxing out the CPU of the web servers. For example
         | Anthropic hitting a server 11 times per second non-stop easily
         | loads a basic web server serving a dynamic website. That's like
         | 1 million page views per day. And they continue for weeks even
         | though they could have scraped whatever they are after in less
         | than an hour.
        
         | rcxdude wrote:
         | Seems like the latter. There's basically a large number of
         | well-funded attempts to crawl the internet, and enough of them
         | are badly behaved enough it's basically a DDOS against smaller
         | hosts.
        
       | kragen wrote:
       | Sheesh, just use BitTorrent. That's what open access licensing is
       | for! BitTorrent's tit-for-tat approach limits the harm selfish
       | actors can do, only greatly rewarding those whose behavior
       | benefits others, and has been shown to be very robust against
       | active disruption attempts for decades now. Moreover, it also
       | confers some resistance to falsification of the published record,
       | to linkrot, and to publishing companies going bankrupt.
       | 
       | Sooner or later we need to take back the legitimate internet from
       | surveillance capitalism. Capitalism is great (it shares many of
       | BitTorrent's virtues, not coincidentally) but surveillance
       | capitalism is not.
        
         | quectophoton wrote:
         | As much as I like BitTorrent, people (usually) don't want to
         | provide open access to information; what they (usually) want is
         | to be an "open" gateway to that information, as long as they
         | are the centralized point of distribution whose name appears in
         | the URL bar, and as long as they control when they can remove
         | access to that information.
         | 
         | Creating a torrent is not showy enough, because the credit is
         | "just" another file and/or a comment in the torrent metadata.
         | 
         | Granted, they usually do that because they want to "kindly"
         | advertise a way to donate to them (EDIT: or to track you, or
         | other similar goals), and there's nothing wrong with trying to
         | get donations, but there's clearly a conflict of interest at
         | play here.
        
           | kragen wrote:
           | It doesn't matter what people _usually_ want. It 's
           | sufficient for _someone_ to want to torrent the open-access
           | articles, even if everyone else is playing the exploitative
           | games you 're describing. The Berlin Declaration that defined
           | "open access" https://openaccess.mpg.de/Berlin-Declaration
           | requires specifically
           | 
           | > _The author(s) and right holder(s) of such contributions
           | grant(s) to all users a free, irrevocable, worldwide, right
           | of access to, and a license to copy, use, distribute,
           | transmit and display the work publicly and to make and
           | distribute derivative works, in any digital medium for any
           | responsible purpose, subject to proper attribution of
           | authorship (community standards, will continue to provide the
           | mechanism for enforcement of proper attribution and
           | responsible use of the published work, as they do now), as
           | well as the right to make small numbers of printed copies for
           | their personal use._
           | 
           | This guarantees that such torrents are legal unless the
           | original authors are infringing copyright.
           | 
           | So there is no danger of AI bots destroying open access.
        
       | zzo38computer wrote:
       | I have temporarily disabled my HTTP server for now. (I set up
       | port knocking for a day, but I got rid of it due to a kernel
       | panic.)
       | 
       | My issue is not to prevent anyone from obtaining a copy if they
       | want to do, and I want to ensure that users can use curl, Lynx,
       | and other programs; I do not want to require JavaScripts, CSS,
       | Firefox, Google, etc.
       | 
       | My problem is that these LLM scraping bots are badly behaved,
       | making many requests and repeating them even though there is no
       | good reason to do so, and potentially overloading the servers.
       | These things are mentioned in the article. Some bots are not so
       | badly behaved, and those are not the problem.
        
         | paulddraper wrote:
         | How can you tell they are LLM bots?
        
           | zzo38computer wrote:
           | I do not know for sure, but they are accessing with many
           | different IP addresses, and with many different user-agent
           | values that all include "Mozilla". I had read elsewhere that
           | apparently they are botnets for LLM scraping.
        
       | josefritzishere wrote:
       | AI is a scourge. It provides next to nothing useful but wrecks
       | havok. As passing fads go it's heavy on the distruption but light
       | on the utility.
        
       | kazinator wrote:
       | > _The old style bots were rarely a problem. They respected robot
       | exclusions and "nofollow" warnings._
       | 
       | What year are they reminiscing about here, 1999? Nothing has
       | respected robots.txt in over twenty years.
       | 
       | Nofollow isn't an anti-bot measure; it's supposed to inform
       | search engines that you don't vouch for the linked content (don't
       | wish to boost its rank). Nofollow doesn't mean "you must not
       | follow this if you're a crawler".
        
       ___________________________________________________________________
       (page generated 2025-03-25 23:02 UTC)