Post AshFu1SyKH4CcAWbVA by cmccullough@discuss.systems
 (DIR) More posts by cmccullough@discuss.systems
 (DIR) Post #Ash6raYuU8l10R4DAG by jimsalter@fosstodon.org
       2025-04-02T22:25:41Z
       
       0 likes, 1 repeats
       
       You may now add me to the list of FOSS folks directly impacted by unethical AI scraping effectively performing Denial of Service attacks.My wife informed me this morning that our billing system had been knocked offline.The reason? #Amazon bot scraping traffic blew up my access logs to the point of filling that server's entire drive. They're constantly scraping and re-scraping my FreeBSD wiki.https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries/
       
 (DIR) Post #Ash9NQUjeOPQxRQJ4y by jimsalter@fosstodon.org
       2025-04-02T22:53:52Z
       
       0 likes, 0 repeats
       
       In January 2023, freebsdwiki.net--which has not been actively updated for more than a decade--received 812,706 lines in its httpd-access.log.In January 2024, that was up to 2,690,389 lines.As of January 2025, it was up to 7,607,518 lines. And last month--March 2025--it was up to an eye-watering 18,168,053 lines.
       
 (DIR) Post #AshA0R95AHPsGmtffc by bobthcowboy@fosstodon.org
       2025-04-02T23:00:53Z
       
       0 likes, 0 repeats
       
       @jimsalter Are there explanations for *why* the scrapers seem to want to re-scrape so often?  Also why does it feel like this is only being discussed on our side of this fence?  Maybe I've missed some article where Facebook employees are explaining why they're re-indexing so often, or how they're working on building a better scraper?
       
 (DIR) Post #AshAEO1tXwdQezBe3k by Prozak@corteximplant.com
       2025-04-02T23:03:23Z
       
       0 likes, 0 repeats
       
       @jimsalter wonder if there is a way to add a views google ad sense or something to capitalize on these scrapes
       
 (DIR) Post #AshAHMMqBOieNEOyFk by stragu@mastodon.indie.host
       2025-04-02T23:03:55Z
       
       0 likes, 0 repeats
       
       @jimsalter "According to a comprehensive recent report from LibreNews, some open source projects now see as much as 97 percent of their traffic originating from AI companies' bots, dramatically increasing bandwidth costs, service instability, and burdening already stretched-thin maintainers."Wild.
       
 (DIR) Post #AshBB2K1tjaYQ9dT84 by jimsalter@fosstodon.org
       2025-04-02T23:14:02Z
       
       0 likes, 0 repeats
       
       Here's a breakdown of user-agent strings seen in March 2025. The top two user agents are Amazonbot and Gptbot. Between the two of them, they account for 35% of all traffic to the site.That doesn't sound as bad as the chart makes it look... but even the chart doesn't capture the full story.
       
 (DIR) Post #AshBiffkcCpm1IpkxM by jimsalter@fosstodon.org
       2025-04-02T23:20:07Z
       
       0 likes, 0 repeats
       
       THIS is what Amazon's scraper traffic looks like: a never ending series of RecentChanges pulls, several times a second, from multiple IP addresses.This is an INSANELY difficult load to manage, because it isn't really cachable and hits the database HARD.
       
 (DIR) Post #AshCJPPu13hkJowGWG by jimsalter@fosstodon.org
       2025-04-02T23:26:45Z
       
       0 likes, 0 repeats
       
       @bobthcowboy so, you're wondering why the supervillains aren't monologuing? I mean, they're NOT GOOD PEOPLE. Good people don't behave like this. There's really no doubt to extend any benefit from, here.If you really want to listen to supervillain monologues, you can absolutely find plenty of video of the likes of Sam Altman declaring that anything being done in pursuit of AI is worth it, no matter the cost, period.
       
 (DIR) Post #AshCUJqB5YOED4QnKK by jimsalter@fosstodon.org
       2025-04-02T23:28:41Z
       
       0 likes, 0 repeats
       
       I hope you didn't expect to see different tactics from OpenAI's gptbot. It's not rotating IP addresses... but I betcha if I block 20.171.207.122 today, tomorrow I see a giant pool of IP addresses behind gptbot just like the one I see now behind amazonbot.
       
 (DIR) Post #AshCYW6TRBudOu3TpA by dan@infosec.exchange
       2025-04-02T23:29:26Z
       
       0 likes, 0 repeats
       
       @jimsalter do they ignore robots.txt?
       
 (DIR) Post #AshD7B4fcCMsJcjsrw by jimsalter@fosstodon.org
       2025-04-02T23:35:45Z
       
       0 likes, 0 repeats
       
       Let's talk about the next few entries on the list:Semrushbot: this is for an "SEO index", it's been a plague on the net for years. MJ12bot is for a UK-based engine called "Majestic" which maps links, ignoring actual content. Another very long term, prolific abuser, dwarfed by the current AI scraping.Petalbot: this one is for "Petal search," which is exclusively used on Huawei phones. Doesn't even have a website!Barkrowler is another SEO engine focused on links, not content.
       
 (DIR) Post #AshEvYzd3mbCvnyFPc by jimsalter@fosstodon.org
       2025-04-02T23:56:04Z
       
       0 likes, 0 repeats
       
       After that, we've got several more bots, several demonstrably fake lookalike "chrome" or "safari" agents, and even the first several human-LOOKING user agents are obviously fake... unless you believe, for example, that an elderly version of Microsoft Edge has a greater percent of the traffic share than ALL versions of Safari, Chrome, and Firefox combined.Bytespider is notable here for pretending it's "mobile Safari" running on Android. Sure Jan.
       
 (DIR) Post #AshF1RXveyJnZaomPY by bobthcowboy@fosstodon.org
       2025-04-02T23:57:07Z
       
       0 likes, 0 repeats
       
       @jimsalter No, you're right.  I'm often involved in hiring in my roles and I actually decided ~10 years ago that I would never hire someone who worked at Facebook at that time or later.  And like, there's people right here on Fosstodon who still work for Big Tech, despite the general state of that scene.  I don't get it and I try to understand what the motivation is aside from greed or naivety.
       
 (DIR) Post #AshFmZm8shtAhpwOEy by jimsalter@fosstodon.org
       2025-04-03T00:05:39Z
       
       0 likes, 0 repeats
       
       Anyway, here's another graphical view into the problem: look at what happened in December 2024.The "unique visitors" stat went up by an order of magnitude between Nov 24 and Dec 24... and not only has it not dropped since, it's gotten actively WORSE.
       
 (DIR) Post #AshFu1SyKH4CcAWbVA by cmccullough@discuss.systems
       2025-04-03T00:06:55Z
       
       0 likes, 0 repeats
       
       @jimsalter Wow!
       
 (DIR) Post #AshG2oc1UD8VM3x1m4 by jimsalter@fosstodon.org
       2025-04-03T00:08:34Z
       
       0 likes, 0 repeats
       
       Not only has the "unique visitors" stat stayed in the six figures level, the *bandwidth* has gone up from typically around 700MiB per month to a whopping TWENTY-SIX GiB per month.If you're wondering why the stats for February show "zero" visitors and uniques, well, that stats run fired off just prior to the cron job that COMPRESSES all the logfiles--so I'm pretty sure the script failed ENOSPC there, recovering for a little while once the last few humongous logs were gzipped.
       
 (DIR) Post #AshGCbtNdMgmnfjxbM by jimsalter@fosstodon.org
       2025-04-03T00:10:21Z
       
       0 likes, 0 repeats
       
       @bobthcowboy I won't say I wouldn't hire anybody who worked for Facebook or, say, Microsoft... there was a period in my life when I was actively being recruited by those orgs, and it was DIFFICULT soul-searching figuring out whether the best thing to do was say hell no, or to say yes and try to change the orgs from within.I would definitely be probing such candidates on morals and ethics during an interview, though.
       
 (DIR) Post #AshGJ6pPWlyAU3yyrg by jimsalter@fosstodon.org
       2025-04-03T00:11:25Z
       
       0 likes, 0 repeats
       
       @Rairii sure is. This is a mediawiki site, and if you believe AI crawlers don't know EXACTLY what a mediawiki is and hunger SPECIFICALLY for its data... welp.
       
 (DIR) Post #AshHC0SOuuA2eROfaq by jimsalter@fosstodon.org
       2025-04-03T00:21:26Z
       
       0 likes, 0 repeats
       
       Are the scrapers hitting my tech blog also? You'd better believe it--but, lacking the tooling MediaWiki offers to show you just what's changed, they aren't scraping anywhere NEAR so frequently.After all, they don't want to burn their OWN processing time actually parsing things before they ingest into the latest model, now do they...?
       
 (DIR) Post #AshHvpV7z1PlzbWSm0 by jimsalter@fosstodon.org
       2025-04-03T00:29:44Z
       
       0 likes, 1 repeats
       
       This does raise an interesting question, though... just what the fuck does Amazonbot think it's doing HERE?
       
 (DIR) Post #AshJv1B1HthDp4rFqq by spineless_echidna@infosec.exchange
       2025-04-03T00:51:58Z
       
       0 likes, 0 repeats
       
       @jimsalter I've got my popcorn out.At this stage I resorted to using Fail2Ban for bots that transgress where robots.txt says they shouldn't go, but the lists can get so long that it starts to impact performance.
       
 (DIR) Post #AshJwnoVwSeExQBLgO by jimsalter@fosstodon.org
       2025-04-03T00:52:20Z
       
       0 likes, 0 repeats
       
       @spineless_echidna this is one of many, many reasons I am constantly advising people to avoid fail2ban. :)
       
 (DIR) Post #AshK1BPm8UTdJShQsy by jimsalter@fosstodon.org
       2025-04-03T00:52:58Z
       
       0 likes, 0 repeats
       
       @rubenerd @Rairii the scrapers haven't found another wiki located on the same server yet.Can't wait until the problem doubles itself. Yay.
       
 (DIR) Post #AshKBU3XgNNK1YDCRk by spineless_echidna@infosec.exchange
       2025-04-03T00:54:57Z
       
       0 likes, 0 repeats
       
       @jimsalter is there a decent alternative? I'm starting to consider rolling out nepenthes or other tarpits in a parallel VM, but I wonder at what cost.I guess I could re-write the URL on a per-user-agent basis and just re-direct them to the tarpit. If they're smart enough to detect it and blacklist my genuine addresses, it's still a win.
       
 (DIR) Post #AshKWHGCS6OwQ530dM by jimsalter@fosstodon.org
       2025-04-03T00:58:43Z
       
       0 likes, 1 repeats
       
       @spineless_echidna I'm also strongly considering Nepenthes or similar. I'd really, really like to actively POISON those bots with false data, not just hamper them with a tarpitted workload.You want to train your models? Step right the fuck up, have I ever got some training data for YOU...
       
 (DIR) Post #AshLJtACJlJ0ADteKm by autolycus@fosstodon.org
       2025-04-03T01:07:40Z
       
       0 likes, 0 repeats
       
       @jimsalter When I told gptbot to piss off in my robots.txt the next day I suddenly started receiving massive (and roughly equal) traffic from supposed Firefox and Chrome users who can apparently read as fast as a bot and know which files in my CMS to look for that aren't public facing.Cloudflare is helping some, but it's like playing a game of whack-a-mole.We need legislation to make robots.txt legally enforceable. It should be a no trespassing sign.
       
 (DIR) Post #AshLr00bt6Ucx0CTuy by autolycus@fosstodon.org
       2025-04-03T01:13:35Z
       
       0 likes, 0 repeats
       
       @jimsalter @spineless_echidna They are truly a scourge to the internet as a whole. People focus on the energy consumption of the GPUs, but the constant bot strain on websites everywhere cause everyone to use more energy and bandwidth.
       
 (DIR) Post #AshLs0mGW1S5ckyPo0 by matt_garber@mastodon.sdf.org
       2025-04-03T01:13:30Z
       
       0 likes, 0 repeats
       
       @jimsalter @spineless_echidna While realizing it’s not perfect for the smaller percentage that lie with old browser versions, I’ve found it *extremely* effective to have a regex block of ~2-3 dozen of the worst behaving offenders that provide no value to me — classic annoyances like MJ12bot and Turnitinbot, to newer Bytespider, GPTBot, PerplexityBot, etc. — that is imported for all my sites in nginx & instantly returns 403 for all reqs. Fast and effective for the largest, low-hanging fruit.
       
 (DIR) Post #AshMNPnfX2RqCvbWnw by jimsalter@fosstodon.org
       2025-04-03T01:19:31Z
       
       0 likes, 0 repeats
       
       @matt_garber @spineless_echidna OpenAI and Amazon have both already been caught lying about their user agents and using residential IP proxies (which begs the question: how are they getting ACCESS to pools of residential IP addresses?)when blocked.I'm looking into deliberately poisoning their models with trash data, personally. You want my CPU cycles and bandwidth? You got 'em, hope you enjoy 'em...
       
 (DIR) Post #AshMWxKOFImenYSVMG by Tubsta@soc.feditime.com
       2025-04-03T01:08:04.767822Z
       
       0 likes, 0 repeats
       
       @jimsalter @spineless_echidna Please add to 2.5 Admins content 🍿
       
 (DIR) Post #AshMWxzrl5xcsBrd0S by oxyhyxo@mastodon.bsd.cafe
       2025-04-03T01:18:10Z
       
       0 likes, 0 repeats
       
       @Tubsta @jimsalter @spineless_echidna pissing off the people who know how to do things is always a winning strategy. *Grabs Popcorn*
       
 (DIR) Post #AshMWydvMA0GsQbcRc by jimsalter@fosstodon.org
       2025-04-03T01:21:08Z
       
       0 likes, 0 repeats
       
       @oxyhyxo @Tubsta @spineless_echidna might be interesting to run one of those horrible spammer tools that tries to bypass blocks by using a thesaurus on every third word, against some readily-available public domain text that can be ethically sourced.I'm trying to think of cheap ways to poison the scraped data, too subtly to detect until it makes it into the actual model.
       
 (DIR) Post #AshMkNXy4k0LsbD624 by jimsalter@fosstodon.org
       2025-04-03T01:23:38Z
       
       0 likes, 0 repeats
       
       @feoh absolutely, if you can manage it. I'm guessing it's following an anchor tag it found on some script kiddie's site, from some attempt said skiddie was making to exploit something along the lines of log4j or similar, but at this point, who the hell knows for sure?
       
 (DIR) Post #AshMm3xlrIRJik5Agq by oxyhyxo@mastodon.bsd.cafe
       2025-04-03T01:23:57Z
       
       0 likes, 0 repeats
       
       @jimsalter @Tubsta @spineless_echidna thesaurus but 2 or 3 definitions deep. Swap the odd noun for a verb.
       
 (DIR) Post #AshMoOdvwRjCRCusOu by jimsalter@fosstodon.org
       2025-04-03T01:24:25Z
       
       0 likes, 0 repeats
       
       @oxyhyxo @Tubsta @spineless_echidna you're definitely picking up what I'm putting down.
       
 (DIR) Post #AshMsovSAKY36tDjkW by oxyhyxo@mastodon.bsd.cafe
       2025-04-03T01:24:34Z
       
       0 likes, 0 repeats
       
       @jimsalter @Tubsta @spineless_echidna aka put whatever they're scraping through a grammatical woodchipper
       
 (DIR) Post #AshMspMOYA6CSRpFk9 by jimsalter@fosstodon.org
       2025-04-03T01:25:10Z
       
       0 likes, 0 repeats
       
       @oxyhyxo @Tubsta @spineless_echidna I wonder how hard it would be to just proxy a scraper direct to output from a hosted ChatGPT or similar instance... "incest" in the training data is ROUGH on LLMs.
       
 (DIR) Post #AshMwuTZTJARTy83YO by j_angliss@fosstodon.org
       2025-04-03T01:25:56Z
       
       0 likes, 0 repeats
       
       @jimsalter weird referrals? newdumpspdf looks like some exam dump website. Wonder if you had some post that looked like it was certificate or training related and ended up scraped by some weird website
       
 (DIR) Post #AshN6ECXVEytGWbcFk by jimsalter@fosstodon.org
       2025-04-03T01:27:37Z
       
       0 likes, 0 repeats
       
       @oxyhyxo @Tubsta @spineless_echidna I'm not looking to put it through a grammatical woodchipper so much as subtly change the actual MEANING, in ways that are less likely to be detected before they're ingested and potentially do major damage to the model itself.Models interpret words as vectors. Consider an engineering problem, in which you only "slightly" modify the vectors of a portion of the moving parts inside a machine...
       
 (DIR) Post #AshNB4qdhnJPhlfsPI by oxyhyxo@mastodon.bsd.cafe
       2025-04-03T01:28:27Z
       
       0 likes, 0 repeats
       
       @jimsalter @Tubsta @spineless_echidna yeah the prompt can be their useragent string plus a dozen random dictionary words -"give me 1000 word explication on ${dictionary words, useragent_string}" > ${http_response}
       
 (DIR) Post #AshND2Ju4jcIRZISJs by durchaus@mastodon.social
       2025-04-03T01:28:45Z
       
       0 likes, 0 repeats
       
       @jimsalter @bobthcowboy these orgs are too big to change anything from within, at least when you're just some engineer.
       
 (DIR) Post #AshNPWtoD0Ecl433k8 by oxyhyxo@mastodon.bsd.cafe
       2025-04-03T01:31:06Z
       
       0 likes, 0 repeats
       
       @jimsalter @Tubsta @spineless_echidna added points if you can get it accepted as a GSOC project ;)
       
 (DIR) Post #AshNavGZx5wRW88DFg by matt_garber@mastodon.sdf.org
       2025-04-03T01:33:04Z
       
       0 likes, 0 repeats
       
       @jimsalter @spineless_echidna Yeah, that’s fair enough, and the poisoned dataset methods are enticing, too. FWIW, from the sample of sites I block the UAs for, some fairly high traffic, I haven’t yet seen a *commensurate* offsetting increase in bogus UAs vs. all the requests I still see coming in from their “genuine” crawler identifier (getting denied), that that’s still my primary method and which doesn’t rely on IPs at all. (I also rate limit everything dynamic to keep reqs/s reasonable.)
       
 (DIR) Post #AshNuK1y6S3XpBY1zs by durchaus@mastodon.social
       2025-04-03T01:36:40Z
       
       0 likes, 0 repeats
       
       @jimsalter first, I'm wondering that the scraper bots are so honest w.r.t. their user agent string. This way, it should be very easy to just block them based on their user agent.Second, why do they have to update so often? Google and other search engines don't scrape as much, right? Scraping data for AI isn't actually that much different from search engine scraping, neither in content nor frequency. So why?
       
 (DIR) Post #AshOIJhWypqrkhQijA by jimsalter@fosstodon.org
       2025-04-03T01:41:01Z
       
       0 likes, 0 repeats
       
       @durchaus they're only "honest" until blocked, either via robots.txt or more direct tactics (like blackholing IP ranges).Once the gig is demonstrably up, they hide the UA, use residential IP addresses as proxies, and so on.
       
 (DIR) Post #AshOOs1mGfHjx41ceW by jimsalter@fosstodon.org
       2025-04-03T01:42:12Z
       
       0 likes, 0 repeats
       
       @durchaus @bobthcowboy I came to the same conclusion, which is why I never went that route even when headhunted.Once I got more overtly political, the headhunting attempts largely stopped. I'm okay with THAT, too.
       
 (DIR) Post #AshVS57MkrGD687LG4 by opticron@eat.fruits.social
       2025-04-03T01:46:54.859559Z
       
       0 likes, 0 repeats
       
       @kevin @dan @jimsalter the RTEMS project is having similar problems on it's gitlab instance
       
 (DIR) Post #AshVS69ssyrwKEIrp2 by jimsalter@fosstodon.org
       2025-04-03T03:00:55Z
       
       0 likes, 0 repeats
       
       @opticron @dan @kevin it strikes me that one might feed an LLM ***extremely*** malicious "training data", when said LLM is scraping Git instances looking for examples of how useful code is validly constructed.
       
 (DIR) Post #AshsGi8g9BH7agHKaW by pertho@mastodon.bsd.cafe
       2025-04-03T07:16:50Z
       
       0 likes, 0 repeats
       
       @jimsalter Welcome to my world. We've blocked the legit AI bot user-agents only to find the sleazy Alibaba LLC/TenCent/Huawei bots slamming the sites with older Chrome agents (120 or older)
       
 (DIR) Post #AsiAJ2lMEmS7JBdG7s by hamoid@genart.social
       2025-04-03T10:38:57Z
       
       0 likes, 0 repeats
       
       @jimsalter @hallo Have you uberspace noticed this on your servers?
       
 (DIR) Post #AsiUcn9JzneRl7XF20 by jimsalter@fosstodon.org
       2025-04-03T14:26:38Z
       
       0 likes, 0 repeats
       
       @kevin @opticron @dan AFAICT, anubis merely makes the operations expensive. I don't want to make the operations expensive, I want to poison the actual *model* those operations feed.
       
 (DIR) Post #AszJ82hCcj8JchuSuG by elrey741@infosec.exchange
       2025-04-11T17:07:55Z
       
       0 likes, 1 repeats
       
       @jimsalter @durchaus I know you were postulating about how they gained access to the residential IPs in https://2.5admins.com/2-5-admins-242/.I've actually seen scraping services advertise residential proxies before, like this one: https://brightdata.com/proxy-types/residential-proxies.So, it still begs the question as to how they (bright data) get access, but OpenAI/Amazon having access to tunnel through residential proxies doesn't necessarily mean they directly purchased access to compromised machines (which, IIRC, was one of the theries you two had mentioned).
       
 (DIR) Post #AszZzq5tI853RZ8DnU by jimsalter@fosstodon.org
       2025-04-11T20:16:57Z
       
       0 likes, 0 repeats
       
       @elrey741 @durchaus the phrase "first against the wall, when the revolution comes" isn't used often enough.
       
 (DIR) Post #At3f3iuFEaSUiECAt6 by jornfranke@mastodon.online
       2025-04-13T19:32:29Z
       
       0 likes, 0 repeats
       
       @jimsalter In any case: always put logs on a separate partition (and rotate them), so that logging cannot bring down the server.(of course this does not change that everyone is attacked by the AI scrapers)
       
 (DIR) Post #At5tPBoajayO7wbKL2 by jimsalter@fosstodon.org
       2025-04-14T21:22:41Z
       
       0 likes, 0 repeats
       
       @jornfranke the logs WERE being rotated, and compressed at time of rotation. There's only so much you can do about a site unexpectedly taking multiple orders of magnitude of traffic than it's ever received in twenty years of continuous operation.