[HN Gopher] AI bots are destroying Open Access
___________________________________________________________________
AI bots are destroying Open Access
Author : dhacks
Score : 67 points
Date : 2025-03-25 18:15 UTC (4 hours ago)
(HTM) web link (go-to-hellman.blogspot.com)
(TXT) w3m dump (go-to-hellman.blogspot.com)
| nathanaldensr wrote:
| The only way--the _only_ way--to solve these issues is with web
| servers requiring that all clients authenticate with a credential
| that is provably tied to a real-world entity--person or corporate
| entity--so that legal recourse is available to the server owner
| when abuse occurs. The internet is no longer high-trust; we 're
| running web servers the same way we'd run an honor-system store
| where people just come in and steal, anonymously, and with no
| recourse.
| ronsor wrote:
| I guess I'm done using the Internet then.
| batata_frita wrote:
| We're just ruining the last good part of internet
| eesmith wrote:
| $125 and I can start an LLC.
|
| That's a real-world corporate entity. Recourse ends at the
| "limited liability" in LLC.
|
| Make that LLC owned by another? Offshore ownership? Might take
| a few thousand bucks.
| ronsor wrote:
| Cheaper than that, and the Secretary of State doesn't
| actually verify anything.
| JohnFen wrote:
| The truth and tragedy of this is very clear to me. I am hoping
| this is something that will eventually be solved, but I don't
| expect it. These companies are on a burn-and-pillage rampage.
| Verdex wrote:
| I don't know. Once I know the who the legal entity is who I
| assert is a bad actor, I'm not sure there is really an recourse
| to be had.
|
| Your honor, these people are visiting my website in a way that
| makes me sad? I feel that we would need to encode bad behavior
| in a legally reasonable way first.
|
| And not to mention that you'll have to bring legal disputes a
| legal entity at a time. And some of these legal entities have
| very deep pockets.
|
| Unless the suggestion is that internet providers are all going
| to join together to stand up for the little guy? Somehow I'm
| not optimistic.
|
| (Finally IPv6 has taken decades to get to where it is today.
| Somehow I don't see legally attributable IP traffic extension
| to be ready and deployed any faster)
| akomtu wrote:
| In reality, users will have to show passport to use the
| internet, while corporations will hide behind a "Corporate ID"
| that's whitelisted in all authenticator services, because those
| are also corporations. So you'll keep getting millions of
| requests from corp234 and corp456 with no legal recourse
| against them.
| xnx wrote:
| Why are "AI" bots generating so much fuss. Is it because there
| are so many of them? Is it because AI companies are each writing
| their own (bad) crawlers instead of using existing ones?
| jsheard wrote:
| AI bots operators are financially incentivized to not be good
| citizens, they want as much data as possible as fast as
| possible and don't care who they piss off in the process. Plus
| for now at least they have effectively unlimited money to throw
| at bandwidth, storage, IP addresses, crawling with full-blown
| headless browsers, etc.
| quectophoton wrote:
| And it gets worse.
|
| For now they are probably paying to use residential IP
| addresses that they get from other services that sell them
| (and these services get them from people who willingly sell
| some of their bandwidth for cents).
|
| But I think it won't be long before we start seeing the AI
| companies having each _their own_ swarm of residential IP
| addresses by selling _themselves_ a browser extension or
| mobile app, saying something like:
|
| "Get faster results (or a discount) by using our extension!
| By using your own internet connection to fetch the required
| context, you won't need to share computing resources with
| other users, thusly increasing the speed of your queries!
| Plus, since you don't use our servers, that means we can pass
| our savings to you as a discount!"
|
| Then in small letter saying they use your connection for
| helping others with their queries, or being more eco-friendly
| because sharing, or whatever they come up with to justify
| this.
| duttonw wrote:
| OpenAI has 'already' got a browser extension. Who knows
| when this is 'enabled'. We already had the 'honey' debacle
| with Amazon/ebay referral link stealing
| loloquwowndueo wrote:
| This is explained in the article. Tl;dr for whichever reason
| these AI bots behave nothing like the web crawlers of old. To
| quote TFA:
|
| > The current generation of bots is mindless. They use as many
| connections as you have room for. If you add capacity, they
| just ramp up their requests. They use randomly generated user-
| agent strings. They come from large blocks of IP addresses.
| They get trapped in endless hallways. I observed one bot asking
| for 200,000 nofollow redirect links pointing at Onedrive,
| Google Drive and Dropbox. (which of course didn't work, but
| Onedrive decided to stop serving our Canadian human users).
| They use up server resources - one speaker at Code4lib
| described a bug where software they were running was using 32
| bit integers for session identifiers, and it ran out!
| flakeoil wrote:
| They are maxing out the CPU of the web servers. For example
| Anthropic hitting a server 11 times per second non-stop easily
| loads a basic web server serving a dynamic website. That's like
| 1 million page views per day. And they continue for weeks even
| though they could have scraped whatever they are after in less
| than an hour.
| rcxdude wrote:
| Seems like the latter. There's basically a large number of
| well-funded attempts to crawl the internet, and enough of them
| are badly behaved enough it's basically a DDOS against smaller
| hosts.
| kragen wrote:
| Sheesh, just use BitTorrent. That's what open access licensing is
| for! BitTorrent's tit-for-tat approach limits the harm selfish
| actors can do, only greatly rewarding those whose behavior
| benefits others, and has been shown to be very robust against
| active disruption attempts for decades now. Moreover, it also
| confers some resistance to falsification of the published record,
| to linkrot, and to publishing companies going bankrupt.
|
| Sooner or later we need to take back the legitimate internet from
| surveillance capitalism. Capitalism is great (it shares many of
| BitTorrent's virtues, not coincidentally) but surveillance
| capitalism is not.
| quectophoton wrote:
| As much as I like BitTorrent, people (usually) don't want to
| provide open access to information; what they (usually) want is
| to be an "open" gateway to that information, as long as they
| are the centralized point of distribution whose name appears in
| the URL bar, and as long as they control when they can remove
| access to that information.
|
| Creating a torrent is not showy enough, because the credit is
| "just" another file and/or a comment in the torrent metadata.
|
| Granted, they usually do that because they want to "kindly"
| advertise a way to donate to them (EDIT: or to track you, or
| other similar goals), and there's nothing wrong with trying to
| get donations, but there's clearly a conflict of interest at
| play here.
| kragen wrote:
| It doesn't matter what people _usually_ want. It 's
| sufficient for _someone_ to want to torrent the open-access
| articles, even if everyone else is playing the exploitative
| games you 're describing. The Berlin Declaration that defined
| "open access" https://openaccess.mpg.de/Berlin-Declaration
| requires specifically
|
| > _The author(s) and right holder(s) of such contributions
| grant(s) to all users a free, irrevocable, worldwide, right
| of access to, and a license to copy, use, distribute,
| transmit and display the work publicly and to make and
| distribute derivative works, in any digital medium for any
| responsible purpose, subject to proper attribution of
| authorship (community standards, will continue to provide the
| mechanism for enforcement of proper attribution and
| responsible use of the published work, as they do now), as
| well as the right to make small numbers of printed copies for
| their personal use._
|
| This guarantees that such torrents are legal unless the
| original authors are infringing copyright.
|
| So there is no danger of AI bots destroying open access.
| zzo38computer wrote:
| I have temporarily disabled my HTTP server for now. (I set up
| port knocking for a day, but I got rid of it due to a kernel
| panic.)
|
| My issue is not to prevent anyone from obtaining a copy if they
| want to do, and I want to ensure that users can use curl, Lynx,
| and other programs; I do not want to require JavaScripts, CSS,
| Firefox, Google, etc.
|
| My problem is that these LLM scraping bots are badly behaved,
| making many requests and repeating them even though there is no
| good reason to do so, and potentially overloading the servers.
| These things are mentioned in the article. Some bots are not so
| badly behaved, and those are not the problem.
| paulddraper wrote:
| How can you tell they are LLM bots?
| zzo38computer wrote:
| I do not know for sure, but they are accessing with many
| different IP addresses, and with many different user-agent
| values that all include "Mozilla". I had read elsewhere that
| apparently they are botnets for LLM scraping.
| josefritzishere wrote:
| AI is a scourge. It provides next to nothing useful but wrecks
| havok. As passing fads go it's heavy on the distruption but light
| on the utility.
| kazinator wrote:
| > _The old style bots were rarely a problem. They respected robot
| exclusions and "nofollow" warnings._
|
| What year are they reminiscing about here, 1999? Nothing has
| respected robots.txt in over twenty years.
|
| Nofollow isn't an anti-bot measure; it's supposed to inform
| search engines that you don't vouch for the linked content (don't
| wish to boost its rank). Nofollow doesn't mean "you must not
| follow this if you're a crawler".
___________________________________________________________________
(page generated 2025-03-25 23:02 UTC)