Post AzfDAbqVUFIW5nGQuO by eric_herman@mas.to
 (DIR) More posts by eric_herman@mas.to
 (DIR) Post #AzfDAaOorhYXbdIo76 by fabio@manganiello.eu
       2025-10-25T19:34:20.474536Z
       
       0 likes, 0 repeats
       
       The current #AI age will forever be known as one of the darkest pages in the history of technology.Those huge models are trained by stealing.Stealing books, stealing posts, stealing code, stealing music, stealing private pictures, stealing everything they can without remorse.Running a Forgejo instance myself I've been flooded with bots too. And yes, I've also noticed that those bots took the efforts of bypassing the Anubis checks. Btw, they download VERY heavy pages too (like git blame and git diff) without bothering to throttle their requests or respecting robots.txt. I'm basically running the CPU on my server at 100% just to let some greedy guys with way more resources than us exploit us to train their AI models.Not only. I also run a Wikipedia frontend (Wikiless), YouTube frontend (Invidious), X frontend (Nitter) and Reddit frontend (Redlib).All of those have been suspended at least once in the past months because of excessive requests. And guess why? Just two weeks ago I had to make my Invidious instance accessible from my VPN only because somebody on the Alibaba network flooded it for days with 25 req/sec to random YouTube videos (I guess that DeepSeek needs multimedia to train a new model?)I believe that grounds for lawsuits against such abuses must be established. As well as commercial deals that allow both parts to profit if they want. But the current state of things isn't sustainable, and it's hitting small self-hosting enthusiasts like me the most.https://social.anoxinon.de/@Codeberg/115435661014427222
       
 (DIR) Post #AzfDAbqVUFIW5nGQuO by eric_herman@mas.to
       2025-10-27T13:28:23Z
       
       0 likes, 0 repeats
       
       @fabio Indeed, it is hitting small self-hosting enthusiasts the most. Sure, we can take some measures now, which will probably work for a time, yet the offenders are already not respecting robots.txt, thus it seems likely they'll work around any other technical measures taken to keep them out. This invites an "arms race" in which we may not be able to stay ahead of the well funded giants. Therefore, I think you're right, we need to step back in order to consider what legal framework is needed.
       
 (DIR) Post #AzfDAcyLIb9xaNwClE by sposadelvento@mastodon.uno
       2025-10-28T08:35:16Z
       
       0 likes, 0 repeats
       
       @eric_herman @fabio I'm not sure about that. Small self-hosting enthusiasts are many, they can quickly apply changes and share their findings across the fediverse. I think that sharing and applying technical measures can really help building a limit to robots here.
       
 (DIR) Post #AzfDAdpA8DPWECeO5g by fabio@manganiello.eu
       2025-10-28T08:52:25.871245Z
       
       1 likes, 0 repeats
       
       @sposadelvento @eric_herman there’s also a limit to how much time we can/want spend keeping the crawlers at bay instead of doing more productive things.And at least in my case there’s also the variable of how many crawler-juicy services we run (not only Forgejo, but also Invidious, Redlib, Nitter, Wikiless etc.)Sure, we can find ways to stop them, we’ve always done it (Anubis is just the latest example), but let’s remember that behind these things there are often large companies with more resources than us.For each countermeasure we find they’ll work on bypassing it, like Codeberg just reported. This is literally like piracy, but the other way around - those with large shoulders are flooding and squeezing every bit they can out of self-hosting enthusiasts like me running mini-PCs in their utilities room.